Short Text Classification using Contextual Analysis

Sami Hamid Al Sulaimani; Andrew Starkey

doi:10.1109/ACCESS.2021.3125768

Short Text Classification using Contextual Analysis

Sami Hamid Al Sulaimani^* (Corresponding Author), Andrew Starkey

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

7 Citations (Scopus)

11 Downloads (Pure)

Abstract

Micro blogging tools provide a real time service for the public to express opinions, to broadcast news and information and offer an opportunity to comment and respond to such output. Word usage in social media is continually evolving. Micro bloggers may use different sets of words to describe a specific event and they may use new words (i.e. neither exist in the training dataset nor in informal or formal dictionaries) or use words in new contexts. Dynamically capturing new words and their potential meaning from their context can help to reflect the words relationship in social media, which then can be useful for solving various problems, like the event classification task. Different approaches have been proposed in this regard, one of them is Contextual Analysis. This paper focuses on examining the potential of this approach for grouping short texts (tweets) talking about the same event into the same category. A new transparent method for text multi-class categorization is presented. It uses the Contextual Analysis approach to capture the most important words in the context of an event and to detect the usage of similar words in different contexts. In order to test the efficacy in these areas, this study evaluates the performance of the proposed method and other well known methods, such as Naïve Bayes, Support Vector Machines, K-Nearest Neighbors and Convolutional Neural Networks. On average, the experiments’ results show that the proposed multi-class classification method can effectively categorize tweets into various event groups, with a high f1-measure score f1>97.09% and f1>95.27%, in the imbalanced classes and high number of classes experiments, respectively. However, similar to the baseline methods, the performance is negatively
influenced by the imbalanced dataset. The Convolutional Neural Networks method produces the best performance among the other algorithms with f1>97.74% in all experiments, which is 1.73% and 2.72% higher than the lowest performance of Naive Bayes and K-Nearest Neighbors, respectively, but does not
meet the requirements of transparency of results.

Original language	English
Pages (from-to)	149619 - 149629
Number of pages	11
Journal	IEEE Access
Volume	9
Early online date	8 Nov 2021
DOIs	https://doi.org/10.1109/ACCESS.2021.3125768
Publication status	Published - 11 Nov 2021

Keywords

Text analysis
event classification
contextual analysis
supervised machine learning

Access to Document

10.1109/ACCESS.2021.3125768Licence: CC BY-NC-ND

Sulaimani_etal_IEEEA_Short_Text_Classification_VoR
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Final published version, 1.39 MBLicence: CC BY-NC-ND

Cite this

@article{11d33a57c38242c8a90bd00ca0e13796,

title = "Short Text Classification using Contextual Analysis",

abstract = "Micro blogging tools provide a real time service for the public to express opinions, to broadcast news and information and offer an opportunity to comment and respond to such output. Word usage in social media is continually evolving. Micro bloggers may use different sets of words to describe a specific event and they may use new words (i.e. neither exist in the training dataset nor in informal or formal dictionaries) or use words in new contexts. Dynamically capturing new words and their potential meaning from their context can help to reflect the words relationship in social media, which then can be useful for solving various problems, like the event classification task. Different approaches have been proposed in this regard, one of them is Contextual Analysis. This paper focuses on examining the potential of this approach for grouping short texts (tweets) talking about the same event into the same category. A new transparent method for text multi-class categorization is presented. It uses the Contextual Analysis approach to capture the most important words in the context of an event and to detect the usage of similar words in different contexts. In order to test the efficacy in these areas, this study evaluates the performance of the proposed method and other well known methods, such as Na{\"i}ve Bayes, Support Vector Machines, K-Nearest Neighbors and Convolutional Neural Networks. On average, the experiments{\textquoteright} results show that the proposed multi-class classification method can effectively categorize tweets into various event groups, with a high f1-measure score f1>97.09% and f1>95.27%, in the imbalanced classes and high number of classes experiments, respectively. However, similar to the baseline methods, the performance is negativelyinfluenced by the imbalanced dataset. The Convolutional Neural Networks method produces the best performance among the other algorithms with f1>97.74% in all experiments, which is 1.73% and 2.72% higher than the lowest performance of Naive Bayes and K-Nearest Neighbors, respectively, but does notmeet the requirements of transparency of results.",

keywords = "Text analysis, event classification, contextual analysis, supervised machine learning",

author = "{Al Sulaimani}, {Sami Hamid} and Andrew Starkey",

year = "2021",

month = nov,

day = "11",

doi = "10.1109/ACCESS.2021.3125768",

language = "English",

volume = "9",

pages = "149619 -- 149629",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "IEEE Explore",

}

TY - JOUR

T1 - Short Text Classification using Contextual Analysis

AU - Al Sulaimani, Sami Hamid

AU - Starkey, Andrew

PY - 2021/11/11

Y1 - 2021/11/11

N2 - Micro blogging tools provide a real time service for the public to express opinions, to broadcast news and information and offer an opportunity to comment and respond to such output. Word usage in social media is continually evolving. Micro bloggers may use different sets of words to describe a specific event and they may use new words (i.e. neither exist in the training dataset nor in informal or formal dictionaries) or use words in new contexts. Dynamically capturing new words and their potential meaning from their context can help to reflect the words relationship in social media, which then can be useful for solving various problems, like the event classification task. Different approaches have been proposed in this regard, one of them is Contextual Analysis. This paper focuses on examining the potential of this approach for grouping short texts (tweets) talking about the same event into the same category. A new transparent method for text multi-class categorization is presented. It uses the Contextual Analysis approach to capture the most important words in the context of an event and to detect the usage of similar words in different contexts. In order to test the efficacy in these areas, this study evaluates the performance of the proposed method and other well known methods, such as Naïve Bayes, Support Vector Machines, K-Nearest Neighbors and Convolutional Neural Networks. On average, the experiments’ results show that the proposed multi-class classification method can effectively categorize tweets into various event groups, with a high f1-measure score f1>97.09% and f1>95.27%, in the imbalanced classes and high number of classes experiments, respectively. However, similar to the baseline methods, the performance is negativelyinfluenced by the imbalanced dataset. The Convolutional Neural Networks method produces the best performance among the other algorithms with f1>97.74% in all experiments, which is 1.73% and 2.72% higher than the lowest performance of Naive Bayes and K-Nearest Neighbors, respectively, but does notmeet the requirements of transparency of results.

AB - Micro blogging tools provide a real time service for the public to express opinions, to broadcast news and information and offer an opportunity to comment and respond to such output. Word usage in social media is continually evolving. Micro bloggers may use different sets of words to describe a specific event and they may use new words (i.e. neither exist in the training dataset nor in informal or formal dictionaries) or use words in new contexts. Dynamically capturing new words and their potential meaning from their context can help to reflect the words relationship in social media, which then can be useful for solving various problems, like the event classification task. Different approaches have been proposed in this regard, one of them is Contextual Analysis. This paper focuses on examining the potential of this approach for grouping short texts (tweets) talking about the same event into the same category. A new transparent method for text multi-class categorization is presented. It uses the Contextual Analysis approach to capture the most important words in the context of an event and to detect the usage of similar words in different contexts. In order to test the efficacy in these areas, this study evaluates the performance of the proposed method and other well known methods, such as Naïve Bayes, Support Vector Machines, K-Nearest Neighbors and Convolutional Neural Networks. On average, the experiments’ results show that the proposed multi-class classification method can effectively categorize tweets into various event groups, with a high f1-measure score f1>97.09% and f1>95.27%, in the imbalanced classes and high number of classes experiments, respectively. However, similar to the baseline methods, the performance is negativelyinfluenced by the imbalanced dataset. The Convolutional Neural Networks method produces the best performance among the other algorithms with f1>97.74% in all experiments, which is 1.73% and 2.72% higher than the lowest performance of Naive Bayes and K-Nearest Neighbors, respectively, but does notmeet the requirements of transparency of results.

KW - Text analysis

KW - event classification

KW - contextual analysis

KW - supervised machine learning

U2 - 10.1109/ACCESS.2021.3125768

DO - 10.1109/ACCESS.2021.3125768

M3 - Article

SN - 2169-3536

VL - 9

SP - 149619

EP - 149629

JO - IEEE Access

JF - IEEE Access

ER -

Short Text Classification using Contextual Analysis

Abstract

Keywords

Access to Document

Fingerprint

Cite this