Bias aware lexicon-based Sentiment Analysis of Malay dialect on social media data: A study on the Sabah Language

M. Hijazi, Lyndia Libin, R. Alfred, Frans Coenen
{"title":"Bias aware lexicon-based Sentiment Analysis of Malay dialect on social media data: A study on the Sabah Language","authors":"M. Hijazi, Lyndia Libin, R. Alfred, Frans Coenen","doi":"10.1109/ICSITECH.2016.7852662","DOIUrl":null,"url":null,"abstract":"Sentiment Analysis (SA) has gained its popularity over the years for the benefit it brings to the development of economy, sociology and politic. SA enables observation, experiment, and quantification of emotions of the public toward a particular issue. However, there is not much SA done with respect to the Malay Language, especially in the context of the Malay dialects used in social media. The research presented in this paper aims to perform SA on one of the derivatives of the Malay language, namely Sabah Language. The Sabah Language, unlike many other languages, does not have a fixed spelling and, when used in an unstructured form as in the case of social media, poses particular difficulties for SA. This paper takes a lexicon-based approach to SA of the Sabah Language as used on social media. For the investigation, the corpuses selected were Facebook posts and tweets written in the Sabah language, 443 posts and tweets in total. Each was manually annotated as positive, negative or neutral by three annotators. As Sabah Language is a derivative of Malay language, the words used in Sabah Language contains most of Malay words. That is why, in Sentiment-Lexicon (SL) construction process, opinion-bearing Malay SL is retrieved, modified and expanded to build Sabah SL. Three different methods of assigning scores to the words in SL (opinion-bearing words) were employed during SL construction: (i) Simple PSA, (ii) Simple PSA with Switch Negation (PSA-SN) and (iii) Strength-based PSA. In this paper, pre-processing phase that includes spellchecker and shortform corrector is also implemented to reduce distinct word to be analyzed for SA. In classification phase, two classification methods, simple and bias aware classifications, were used to classify the posts. Experiments are conducted to show the effect of SL modification and expansion, the effect of pre-processing as well as the effect of bias-aware classification to the SA performed. Results show the highest accuracy of 85.10% was achieved using bias-aware classification with the modified and expanded SL, scores are assigned using Simple PSA and the pre-processed text.","PeriodicalId":447090,"journal":{"name":"2016 2nd International Conference on Science in Information Technology (ICSITech)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 2nd International Conference on Science in Information Technology (ICSITech)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSITECH.2016.7852662","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Sentiment Analysis (SA) has gained its popularity over the years for the benefit it brings to the development of economy, sociology and politic. SA enables observation, experiment, and quantification of emotions of the public toward a particular issue. However, there is not much SA done with respect to the Malay Language, especially in the context of the Malay dialects used in social media. The research presented in this paper aims to perform SA on one of the derivatives of the Malay language, namely Sabah Language. The Sabah Language, unlike many other languages, does not have a fixed spelling and, when used in an unstructured form as in the case of social media, poses particular difficulties for SA. This paper takes a lexicon-based approach to SA of the Sabah Language as used on social media. For the investigation, the corpuses selected were Facebook posts and tweets written in the Sabah language, 443 posts and tweets in total. Each was manually annotated as positive, negative or neutral by three annotators. As Sabah Language is a derivative of Malay language, the words used in Sabah Language contains most of Malay words. That is why, in Sentiment-Lexicon (SL) construction process, opinion-bearing Malay SL is retrieved, modified and expanded to build Sabah SL. Three different methods of assigning scores to the words in SL (opinion-bearing words) were employed during SL construction: (i) Simple PSA, (ii) Simple PSA with Switch Negation (PSA-SN) and (iii) Strength-based PSA. In this paper, pre-processing phase that includes spellchecker and shortform corrector is also implemented to reduce distinct word to be analyzed for SA. In classification phase, two classification methods, simple and bias aware classifications, were used to classify the posts. Experiments are conducted to show the effect of SL modification and expansion, the effect of pre-processing as well as the effect of bias-aware classification to the SA performed. Results show the highest accuracy of 85.10% was achieved using bias-aware classification with the modified and expanded SL, scores are assigned using Simple PSA and the pre-processed text.
基于偏见感知词汇的马来语社交媒体数据情感分析——以沙巴语为例
多年来,情感分析因其对经济、社会学和政治的发展所带来的好处而受到人们的欢迎。情景分析能够观察、实验和量化公众对某一特定问题的情绪。然而,对于马来语,特别是在社交媒体中使用马来方言的背景下,并没有太多的SA。本文提出的研究旨在对马来语的一种衍生物,即沙巴语进行SA。与许多其他语言不同,沙巴语没有固定的拼写,当以非结构化的形式使用时,如在社交媒体的情况下,给SA带来了特别的困难。本文采用基于词典的方法来研究社交媒体上使用的沙巴语SA。为了调查,选择的语料库是用沙巴语写的Facebook帖子和推文,总共443个帖子和推文。每个都由三个注释者手动注释为积极,消极或中性。由于沙巴语是马来语的衍生语言,沙巴语中使用的词汇包含了大部分马来语词汇。这就是为什么在情感-词汇(SL)构建过程中,检索、修改和扩展承载意见的马来语SL,以构建沙巴语SL。在SL构建过程中,使用了三种不同的方法对SL中的单词(承载意见的单词)进行评分:(i)简单PSA, (ii)带有转换否定的简单PSA (PSA- sn)和(iii)基于强度的PSA。本文还实现了包括拼写检查和短格式校正在内的预处理阶段,以减少SA需要分析的不同单词。在分类阶段,采用简单分类和偏见感知分类两种分类方法对岗位进行分类。通过实验验证了语音识别的修饰和扩展效果、预处理效果以及偏见感知分类对语音识别的影响。结果表明,使用改进和扩展的SL进行偏差感知分类的准确率最高,达到85.10%,使用简单PSA和预处理文本进行评分。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信