{"title":"Sentence-level sentiment analysis in Persian","authors":"Mohammad Ehsan Basiri, Arman Kabiri","doi":"10.1109/PRIA.2017.7983023","DOIUrl":null,"url":null,"abstract":"Sentiment analysis (SA) is a subfield of natural language processing and data mining which concerns the problem of extracting useful information from users' comments on the Web. Although researchers have been studying different problems in SA for more than one decade, most studies concentrate on English and languages like Persian have not received the attention they deserved. Resource scarcity for assessing sentiment analysis studies is the main limiting factor in Persian. This paper addresses the problem of resource scarcity by introducing two new resources; a sentence-level dataset for sentiment analysis in Persian, SPerSent and a new Persian lexicon, CNRC. SPerSent contains 150000 sentences, each associated with two labels; a binary label indicating the polarity of the sentence, and a five-star rating. These labels are obtained automatically using a lexicon-based method. Specifically, three lexicons are used independently to label each sentence. Then, the majority voting and average methods are used to aggregate the results for polarity and five-star labels, respectively. Finally, a well-known machine learning method, Naïve Bayes, is used to evaluate the SPerSent.","PeriodicalId":336066,"journal":{"name":"2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA)","volume":"233 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRIA.2017.7983023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 35
Abstract
Sentiment analysis (SA) is a subfield of natural language processing and data mining which concerns the problem of extracting useful information from users' comments on the Web. Although researchers have been studying different problems in SA for more than one decade, most studies concentrate on English and languages like Persian have not received the attention they deserved. Resource scarcity for assessing sentiment analysis studies is the main limiting factor in Persian. This paper addresses the problem of resource scarcity by introducing two new resources; a sentence-level dataset for sentiment analysis in Persian, SPerSent and a new Persian lexicon, CNRC. SPerSent contains 150000 sentences, each associated with two labels; a binary label indicating the polarity of the sentence, and a five-star rating. These labels are obtained automatically using a lexicon-based method. Specifically, three lexicons are used independently to label each sentence. Then, the majority voting and average methods are used to aggregate the results for polarity and five-star labels, respectively. Finally, a well-known machine learning method, Naïve Bayes, is used to evaluate the SPerSent.