Some methods to address the problem of unbalanced sentiment classification in an arabic context

A. Mountassir, H. Benbrahim, I. Berrada
{"title":"Some methods to address the problem of unbalanced sentiment classification in an arabic context","authors":"A. Mountassir, H. Benbrahim, I. Berrada","doi":"10.1109/CIST.2012.6388061","DOIUrl":null,"url":null,"abstract":"The rise of social media (such as online web forums and social networking sites) has attracted interests to mining and analyzing opinions available on the web. The online opinion has become the object of studies in many research areas; especially that called “Opinion Mining and Sentiment Analysis”. Several interesting and advanced works were performed on few languages (in particular English). However, there were very few studies on some languages such as Arabic. This paper presents the study we have carried out to address the problem of unbalanced data sets in supervised sentiment classification in an Arabic context. We propose three different methods to under-sample the majority class documents. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We also aim to evaluate the behavior of the classifier toward different under-sampling rates. We use two different common classifiers, namely Naïve Bayes and Support Vector Machines. The experiments are carried out on an Arabic data set that we have built from Aljazeera's web site and labeled manually. The results show that Naïve Bayes is sensitive to data set size, the more we reduce the data the more the results degrade. However, it is not sensitive to unbalanced data sets on the contrary of Support Vector Machines which is highly sensitive to unbalanced data sets. The results show also that we can rely on the proposed techniques and that they are typically competitive with random under-sampling.","PeriodicalId":120664,"journal":{"name":"2012 Colloquium in Information Science and Technology","volume":"119 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Colloquium in Information Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIST.2012.6388061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

The rise of social media (such as online web forums and social networking sites) has attracted interests to mining and analyzing opinions available on the web. The online opinion has become the object of studies in many research areas; especially that called “Opinion Mining and Sentiment Analysis”. Several interesting and advanced works were performed on few languages (in particular English). However, there were very few studies on some languages such as Arabic. This paper presents the study we have carried out to address the problem of unbalanced data sets in supervised sentiment classification in an Arabic context. We propose three different methods to under-sample the majority class documents. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We also aim to evaluate the behavior of the classifier toward different under-sampling rates. We use two different common classifiers, namely Naïve Bayes and Support Vector Machines. The experiments are carried out on an Arabic data set that we have built from Aljazeera's web site and labeled manually. The results show that Naïve Bayes is sensitive to data set size, the more we reduce the data the more the results degrade. However, it is not sensitive to unbalanced data sets on the contrary of Support Vector Machines which is highly sensitive to unbalanced data sets. The results show also that we can rely on the proposed techniques and that they are typically competitive with random under-sampling.
解决阿拉伯语语境中情感分类不平衡问题的一些方法
社交媒体(如在线网络论坛和社交网站)的兴起吸引了人们对挖掘和分析网络上可用观点的兴趣。网络舆论已成为许多研究领域的研究对象;尤其是所谓的“意见挖掘和情感分析”。几部有趣而先进的作品以几种语言(特别是英语)演出。然而,对阿拉伯语等一些语言的研究却很少。本文介绍了我们为解决阿拉伯语境下监督情感分类中数据集不平衡问题而进行的研究。我们提出了三种不同的方法来对大多数类文档进行欠采样。我们的目标是将所提出的方法与常见的随机欠采样方法的有效性进行比较。我们还旨在评估分类器对不同欠采样率的行为。我们使用两种不同的常用分类器,即Naïve贝叶斯和支持向量机。实验是在一个阿拉伯语数据集上进行的,这个数据集是我们从半岛电视台的网站上建立的,并手动标记。结果表明:Naïve贝叶斯算法对数据集的大小很敏感,数据越少,结果越差。然而,它对不平衡数据集不敏感,相反,支持向量机对不平衡数据集高度敏感。结果还表明,我们可以依赖所提出的技术,并且它们通常与随机欠采样竞争。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信