Effective Text Data Preprocessing Technique for Sentiment Analysis in Social Media Data

2019 11th International Conference on Knowledge and Systems Engineering (KSE) Pub Date : 2019-10-01 DOI:10.1109/KSE.2019.8919368

Saurav Pradha, M. Halgamuge, N. Q. Vinh

{"title":"Effective Text Data Preprocessing Technique for Sentiment Analysis in Social Media Data","authors":"Saurav Pradha, M. Halgamuge, N. Q. Vinh","doi":"10.1109/KSE.2019.8919368","DOIUrl":null,"url":null,"abstract":"In the big data era, data is made in real-time or closer to real-time. Thus, businesses can utilize this evergrowing volume of data for the data-driven or information-driven decision-making process to improve their businesses. Social media, like Twitter, generates an enormous amount of such data. However, social media data are often unstructured and difficult to manage. Hence, this study proposes an effective text data preprocessing technique and develop an algorithm to train the Support Vector Machine (SVM), Deep Learning (DL) and Naïve Bayes (NB) classifiers to process Twitter data. We develop an algorithm that weights the sentiment score in terms of weight of hashtag and cleaned text. In this study, we (i) compare different preprocessing techniques on the data collected from Twitter using various techniques such as (stemming, lemmatization and spelling correction) to obtain the efficient method (ii) develop an algorithm to weight the scores of the hashtag and cleaned text to obtain the sentiment. We retrieved N=1,314,000 Twitter data, and we compared the popularity of two products, Google Now and Amazon Alexa. Using our data preprocessing algorithm and sentiment weight score algorithm, we train SVM, DL, NB models. The results show that stemming technique performed best in terms of computational speed. Additionally, the accuracy of the algorithm was tested against manually sorted sentiments and sentiments produced before text data preprocessing. The result demonstrated that the impact produced by the algorithm was close to the manually annotated sentiments. In terms of model performance, the SVM performed better with the accuracy of 90.3%, perhaps, due to the unstructured nature of Twitter data. Previous studies used conventional techniques; hence, no precise methods were utilized on cleaning the text. Therefore, our approach confirms that proper text data preprocessing technique plays a significant role in the prediction accuracy and computational time of the classifier when using the unstructured Twitter data.","PeriodicalId":439841,"journal":{"name":"2019 11th International Conference on Knowledge and Systems Engineering (KSE)","volume":"283 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 11th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE.2019.8919368","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

Abstract

In the big data era, data is made in real-time or closer to real-time. Thus, businesses can utilize this evergrowing volume of data for the data-driven or information-driven decision-making process to improve their businesses. Social media, like Twitter, generates an enormous amount of such data. However, social media data are often unstructured and difficult to manage. Hence, this study proposes an effective text data preprocessing technique and develop an algorithm to train the Support Vector Machine (SVM), Deep Learning (DL) and Naïve Bayes (NB) classifiers to process Twitter data. We develop an algorithm that weights the sentiment score in terms of weight of hashtag and cleaned text. In this study, we (i) compare different preprocessing techniques on the data collected from Twitter using various techniques such as (stemming, lemmatization and spelling correction) to obtain the efficient method (ii) develop an algorithm to weight the scores of the hashtag and cleaned text to obtain the sentiment. We retrieved N=1,314,000 Twitter data, and we compared the popularity of two products, Google Now and Amazon Alexa. Using our data preprocessing algorithm and sentiment weight score algorithm, we train SVM, DL, NB models. The results show that stemming technique performed best in terms of computational speed. Additionally, the accuracy of the algorithm was tested against manually sorted sentiments and sentiments produced before text data preprocessing. The result demonstrated that the impact produced by the algorithm was close to the manually annotated sentiments. In terms of model performance, the SVM performed better with the accuracy of 90.3%, perhaps, due to the unstructured nature of Twitter data. Previous studies used conventional techniques; hence, no precise methods were utilized on cleaning the text. Therefore, our approach confirms that proper text data preprocessing technique plays a significant role in the prediction accuracy and computational time of the classifier when using the unstructured Twitter data.

查看原文本刊更多论文

面向社交媒体数据情感分析的有效文本数据预处理技术

在大数据时代，数据是实时或接近实时的。因此，企业可以利用这种不断增长的数据量来进行数据驱动或信息驱动的决策过程，以改进其业务。像Twitter这样的社交媒体产生了大量这样的数据。然而，社交媒体数据往往是非结构化的，难以管理。因此，本研究提出了一种有效的文本数据预处理技术，并开发了一种算法来训练支持向量机(SVM)、深度学习(DL)和Naïve贝叶斯(NB)分类器来处理Twitter数据。我们开发了一种算法，根据标签和清理文本的权重对情感评分进行加权。在本研究中，我们(i)比较了从Twitter收集的数据的不同预处理技术，使用各种技术(词干提取、词形化和拼写纠正)来获得有效的方法(ii)开发了一种算法来加权标签和清理文本的分数以获得情感。我们检索了N=1,314,000个Twitter数据，并比较了Google Now和Amazon Alexa这两种产品的受欢迎程度。利用我们的数据预处理算法和情感权重评分算法，我们训练了SVM、DL、NB模型。结果表明，词干提取技术在计算速度方面表现最好。此外，针对文本数据预处理前产生的人工分类情感和情感，测试了算法的准确性。结果表明，该算法产生的影响接近人工标注的情感。在模型性能方面，SVM表现更好，准确率为90.3%，这可能是由于Twitter数据的非结构化性质。以前的研究使用的是传统技术;因此，没有使用精确的方法来清理文本。因此，我们的方法证实了在使用非结构化Twitter数据时，适当的文本数据预处理技术对分类器的预测精度和计算时间起着重要作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 11th International Conference on Knowledge and Systems Engineering (KSE)

自引率

0.00%

发文量