Feature extraction for tweet classification: Do the humans perform better?

2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP) Pub Date : 2017-07-01 DOI:10.1109/SMAP.2017.8022667

N. Tsapatsoulis, Constantinos Djouvas

{"title":"Feature extraction for tweet classification: Do the humans perform better?","authors":"N. Tsapatsoulis, Constantinos Djouvas","doi":"10.1109/SMAP.2017.8022667","DOIUrl":null,"url":null,"abstract":"Sentiment analysis of Twitter data became a research trend the last decade. Thanks to the Twitter API, massive amounts of tweets, relating to a topic of interest, can be collected in real time. Performing sentiment analysis of these tweets can be used to conduct social sensing and opinion mining. For instance, forecasting elections is a primary area in which sentiment analysis of tweets has been extensively applied the last few years. Sentiment analysis of Twitter data presents important challenges compared to the similar task of text classification. Tweets are limited to 140 characters; thus, the conveyed message is compressed and often context-dependent. The tweets are informal and unstructured, usually lacking grammatical soundness and use of a standard lexicon. On the other hand, tweets are usually annotated by their authors regarding their topic and sentiment with the aid of hashtags and emoticons. Identifying appropriate features for sentiment analysis of tweets remains an open research area since text indexing methods face the sparseness problem while POS tagging methods fail due to the lack of grammatical structure of tweets. Character based features, i.e., n-grams of characters, are currently getting popular because they are language independent. However, their effectiveness remains quite low. In this paper, we argue that tokens used by humans for sentiment analysis of tweets are probably the best feature set one can use for that purpose. We compare several automatically extracted features with the features (tokens) used by humans for tweet classification, under a machine learning framework. The results show that the manually indicated tokens combined with a Decision Tree classifier outperform any other feature set-classification algorithm combination. The manually annotated dataset that was used in our experiments is publicly available for anyone who wishes to use it.","PeriodicalId":441461,"journal":{"name":"2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SMAP.2017.8022667","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Sentiment analysis of Twitter data became a research trend the last decade. Thanks to the Twitter API, massive amounts of tweets, relating to a topic of interest, can be collected in real time. Performing sentiment analysis of these tweets can be used to conduct social sensing and opinion mining. For instance, forecasting elections is a primary area in which sentiment analysis of tweets has been extensively applied the last few years. Sentiment analysis of Twitter data presents important challenges compared to the similar task of text classification. Tweets are limited to 140 characters; thus, the conveyed message is compressed and often context-dependent. The tweets are informal and unstructured, usually lacking grammatical soundness and use of a standard lexicon. On the other hand, tweets are usually annotated by their authors regarding their topic and sentiment with the aid of hashtags and emoticons. Identifying appropriate features for sentiment analysis of tweets remains an open research area since text indexing methods face the sparseness problem while POS tagging methods fail due to the lack of grammatical structure of tweets. Character based features, i.e., n-grams of characters, are currently getting popular because they are language independent. However, their effectiveness remains quite low. In this paper, we argue that tokens used by humans for sentiment analysis of tweets are probably the best feature set one can use for that purpose. We compare several automatically extracted features with the features (tokens) used by humans for tweet classification, under a machine learning framework. The results show that the manually indicated tokens combined with a Decision Tree classifier outperform any other feature set-classification algorithm combination. The manually annotated dataset that was used in our experiments is publicly available for anyone who wishes to use it.

查看原文本刊更多论文

推文分类的特征提取:人类表现更好吗?

对Twitter数据的情感分析在过去十年成为一种研究趋势。由于Twitter的API，可以实时收集与感兴趣的主题相关的大量tweet。对这些推文进行情感分析可以用于进行社会感知和意见挖掘。例如，预测选举是推特情绪分析在过去几年得到广泛应用的主要领域。与类似的文本分类任务相比，Twitter数据的情感分析提出了重要的挑战。推文限制在140个字符以内;因此，传递的消息是压缩的，并且通常与上下文相关。这些推文是非正式的、无结构的，通常缺乏语法合理性和标准词汇的使用。另一方面，推文通常由其作者借助标签和表情符号对其主题和情绪进行注释。由于文本索引方法面临稀疏性问题，而词性标注方法由于推文缺乏语法结构而失败，因此确定合适的推文情感分析特征仍然是一个开放的研究领域。基于字符的特征，即n-grams字符，目前正变得越来越流行，因为它们与语言无关。然而，它们的有效性仍然很低。在本文中，我们认为人类用于tweet情感分析的令牌可能是可以用于该目的的最佳功能集。在机器学习框架下，我们将几个自动提取的特征与人类用于tweet分类的特征(令牌)进行比较。结果表明，人工指示标记与决策树分类器的结合优于任何其他特征集分类算法的组合。在我们的实验中使用的手动注释数据集是公开的，任何人都可以使用它。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)

自引率

0.00%

发文量