A TfidfVectorizer and SVM based sentiment analysis framework for text data corpus

2020 National Conference on Communications (NCC) Pub Date : 2020-02-01 DOI:10.1109/NCC48643.2020.9056085

Vipin Kumar, Basant Subba

{"title":"A TfidfVectorizer and SVM based sentiment analysis framework for text data corpus","authors":"Vipin Kumar, Basant Subba","doi":"10.1109/NCC48643.2020.9056085","DOIUrl":null,"url":null,"abstract":"E-commerce and social networking sites are very much dependent on the available data which can be analyzed in real time to predict their future business strategies. However, analyzing huge amount of data manually is not possible in time context of business. Therefore, automated sentimental analysis, which can automatically determine the sentiments from the text data corpus plays an important role in today's world. Many sentimental analysis frameworks with state of the art results have been proposed in the literature. However, many of these frameworks have low accuracy on the textual data corpus contains emoticons and special texts. In addition, many of these frameworks are also energy and computation intensive with which puts limitation in their real time deployment. In this paper, we aim to address these issues by proposing a novel sentimental analysis framework based on Support Vector Machine (SVM). The proposed framework uses a novel technique to tokenize the text documents, wherein stop words, special characters, emoticons present in the text documents are eliminated. In addition, words with similar meanings and annotations are clubbed together into one type, using the concept of stemming. The pre-processed tokenized documents are then vectorized into n-gram integers vectors using the ‘TfidfVectorizer’ for use as input to the SVM based machine learning classifier model. Experimental results on the Amazon's electronics item review dataset and IMDB's movie review data corpus show that the proposed sentimental analysis framework achieves high performance compared to other similar frameworks proposed in the literature.","PeriodicalId":183772,"journal":{"name":"2020 National Conference on Communications (NCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC48643.2020.9056085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

E-commerce and social networking sites are very much dependent on the available data which can be analyzed in real time to predict their future business strategies. However, analyzing huge amount of data manually is not possible in time context of business. Therefore, automated sentimental analysis, which can automatically determine the sentiments from the text data corpus plays an important role in today's world. Many sentimental analysis frameworks with state of the art results have been proposed in the literature. However, many of these frameworks have low accuracy on the textual data corpus contains emoticons and special texts. In addition, many of these frameworks are also energy and computation intensive with which puts limitation in their real time deployment. In this paper, we aim to address these issues by proposing a novel sentimental analysis framework based on Support Vector Machine (SVM). The proposed framework uses a novel technique to tokenize the text documents, wherein stop words, special characters, emoticons present in the text documents are eliminated. In addition, words with similar meanings and annotations are clubbed together into one type, using the concept of stemming. The pre-processed tokenized documents are then vectorized into n-gram integers vectors using the ‘TfidfVectorizer’ for use as input to the SVM based machine learning classifier model. Experimental results on the Amazon's electronics item review dataset and IMDB's movie review data corpus show that the proposed sentimental analysis framework achieves high performance compared to other similar frameworks proposed in the literature.

查看原文本刊更多论文

基于TfidfVectorizer和SVM的文本数据语料情感分析框架

电子商务和社交网站非常依赖于可用的数据，这些数据可以实时分析，以预测他们未来的商业策略。然而，在业务的时间背景下，手工分析大量数据是不可能的。因此，能够从文本数据语料库中自动确定情感的自动情感分析在当今世界具有重要的作用。文献中提出了许多具有最先进成果的情感分析框架。然而，许多框架在包含表情符号和特殊文本的文本数据语料库上准确率较低。此外，这些框架中的许多也是能量和计算密集型的，这限制了它们的实时部署。在本文中，我们旨在通过提出一种基于支持向量机(SVM)的新型情感分析框架来解决这些问题。该框架采用一种新颖的技术对文本文档进行标记，消除了文本文档中存在的停止词、特殊字符和表情符号。此外，使用词干提取的概念，将具有相似含义和注释的单词组合成一种类型。然后使用“TfidfVectorizer”将预处理的标记化文档矢量化为n-gram整数向量，用作基于SVM的机器学习分类器模型的输入。在亚马逊的电子产品评论数据集和IMDB的电影评论数据语料库上的实验结果表明，与文献中提出的其他类似框架相比，所提出的情感分析框架取得了较高的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 National Conference on Communications (NCC)

自引率

0.00%

发文量