Sentence-level sentiment analysis in Persian

2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA) Pub Date : 2017-04-01 DOI:10.1109/PRIA.2017.7983023

Mohammad Ehsan Basiri, Arman Kabiri

{"title":"Sentence-level sentiment analysis in Persian","authors":"Mohammad Ehsan Basiri, Arman Kabiri","doi":"10.1109/PRIA.2017.7983023","DOIUrl":null,"url":null,"abstract":"Sentiment analysis (SA) is a subfield of natural language processing and data mining which concerns the problem of extracting useful information from users' comments on the Web. Although researchers have been studying different problems in SA for more than one decade, most studies concentrate on English and languages like Persian have not received the attention they deserved. Resource scarcity for assessing sentiment analysis studies is the main limiting factor in Persian. This paper addresses the problem of resource scarcity by introducing two new resources; a sentence-level dataset for sentiment analysis in Persian, SPerSent and a new Persian lexicon, CNRC. SPerSent contains 150000 sentences, each associated with two labels; a binary label indicating the polarity of the sentence, and a five-star rating. These labels are obtained automatically using a lexicon-based method. Specifically, three lexicons are used independently to label each sentence. Then, the majority voting and average methods are used to aggregate the results for polarity and five-star labels, respectively. Finally, a well-known machine learning method, Naïve Bayes, is used to evaluate the SPerSent.","PeriodicalId":336066,"journal":{"name":"2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA)","volume":"233 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRIA.2017.7983023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 35

Abstract

Sentiment analysis (SA) is a subfield of natural language processing and data mining which concerns the problem of extracting useful information from users' comments on the Web. Although researchers have been studying different problems in SA for more than one decade, most studies concentrate on English and languages like Persian have not received the attention they deserved. Resource scarcity for assessing sentiment analysis studies is the main limiting factor in Persian. This paper addresses the problem of resource scarcity by introducing two new resources; a sentence-level dataset for sentiment analysis in Persian, SPerSent and a new Persian lexicon, CNRC. SPerSent contains 150000 sentences, each associated with two labels; a binary label indicating the polarity of the sentence, and a five-star rating. These labels are obtained automatically using a lexicon-based method. Specifically, three lexicons are used independently to label each sentence. Then, the majority voting and average methods are used to aggregate the results for polarity and five-star labels, respectively. Finally, a well-known machine learning method, Naïve Bayes, is used to evaluate the SPerSent.

查看原文本刊更多论文

波斯语句子级情感分析

情感分析是自然语言处理和数据挖掘的一个分支，主要研究如何从用户的评论中提取有用的信息。尽管十多年来研究人员一直在研究SA中的不同问题，但大多数研究都集中在英语上，而波斯语等语言没有得到应有的重视。评估情感分析研究的资源稀缺是波斯语的主要限制因素。本文通过引入两种新资源来解决资源稀缺问题;一个用于波斯语情感分析的句子级数据集，SPerSent和一个新的波斯语词典CNRC。SPerSent包含150,000个句子，每个句子与两个标签相关联;表示句子极性的二元标签，以及五星评级。这些标签使用基于词典的方法自动获得。具体来说，三个词汇被独立地用于标记每个句子。然后，使用多数投票法和平均法分别对极性和五星级标签的结果进行汇总。最后，使用著名的机器学习方法Naïve Bayes来评估SPerSent。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA)

自引率

0.00%

发文量