A set of parameters for automatically annotating a Sentiment Arabic Corpus

IF 2.3 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

International Journal of Web Information Systems Pub Date : 2019-12-02 DOI:10.1108/IJWIS-03-2019-0008

Guellil Imane, Darwish Kareem, Azouaou Faical

{"title":"A set of parameters for automatically annotating a Sentiment Arabic Corpus","authors":"Guellil Imane, Darwish Kareem, Azouaou Faical","doi":"10.1108/IJWIS-03-2019-0008","DOIUrl":null,"url":null,"abstract":"This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social medias. It focuses on the Algerian dialect, which is a sub-dialect of Maghrebi Arabic. Although Algerian is spoken by roughly 40 million speakers, few studies address the automated processing in general and the sentiment analysis in specific for Algerian.,The approach is based on the construction and use of a sentiment lexicon to automatically annotate a large corpus of Algerian text that is extracted from Facebook. Using this approach allow to significantly increase the size of the training corpus without calling the manual annotation. The annotated corpus is then vectorized using document embedding (doc2vec), which is an extension of word embeddings (word2vec). For sentiments classification, the authors used different classifiers such as support vector machines (SVM), Naive Bayes (NB) and logistic regression (LR).,The results suggest that NB and SVM classifiers generally led to the best results and MLP generally had the worst results. Further, the threshold that the authors use in selecting messages for the training set had a noticeable impact on recall and precision, with a threshold of 0.6 producing the best results. Using PV-DBOW led to slightly higher results than using PV-DM. Combining PV-DBOW and PV-DM representations led to slightly lower results than using PV-DBOW alone. The best results were obtained by the NB classifier with F1 up to 86.9 per cent.,The principal originality of this paper is to determine the right parameters for automatically annotating an Algerian dialect corpus. This annotation is based on a sentiment lexicon that was also constructed automatically.","PeriodicalId":44153,"journal":{"name":"International Journal of Web Information Systems","volume":"15 1","pages":"594-615"},"PeriodicalIF":2.3000,"publicationDate":"2019-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Web Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/IJWIS-03-2019-0008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 1

Abstract

This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social medias. It focuses on the Algerian dialect, which is a sub-dialect of Maghrebi Arabic. Although Algerian is spoken by roughly 40 million speakers, few studies address the automated processing in general and the sentiment analysis in specific for Algerian.,The approach is based on the construction and use of a sentiment lexicon to automatically annotate a large corpus of Algerian text that is extracted from Facebook. Using this approach allow to significantly increase the size of the training corpus without calling the manual annotation. The annotated corpus is then vectorized using document embedding (doc2vec), which is an extension of word embeddings (word2vec). For sentiments classification, the authors used different classifiers such as support vector machines (SVM), Naive Bayes (NB) and logistic regression (LR).,The results suggest that NB and SVM classifiers generally led to the best results and MLP generally had the worst results. Further, the threshold that the authors use in selecting messages for the training set had a noticeable impact on recall and precision, with a threshold of 0.6 producing the best results. Using PV-DBOW led to slightly higher results than using PV-DM. Combining PV-DBOW and PV-DM representations led to slightly lower results than using PV-DBOW alone. The best results were obtained by the NB classifier with F1 up to 86.9 per cent.,The principal originality of this paper is to determine the right parameters for automatically annotating an Algerian dialect corpus. This annotation is based on a sentiment lexicon that was also constructed automatically.

查看原文本刊更多论文

用于自动标注情感阿拉伯语料库的一组参数

本文旨在提出一种自动标注大型阿拉伯语方言语料库的方法。这个语料库用于分析阿拉伯语用户在社交媒体上的情绪。它侧重于阿尔及利亚方言，这是马格里布阿拉伯语的一种次方言。尽管大约有4000万人说阿尔及利亚语，但很少有研究针对阿尔及利亚语的一般自动化处理和情感分析。该方法基于情感词典的构建和使用，自动注释从Facebook提取的大量阿尔及利亚文本语料库。使用这种方法可以在不调用手动注释的情况下显著增加训练语料库的大小。然后使用文档嵌入(doc2vec)对标注的语料库进行矢量化，文档嵌入是词嵌入(word2vec)的扩展。对于情感分类，作者使用了不同的分类器，如支持向量机(SVM)、朴素贝叶斯(NB)和逻辑回归(LR)。结果表明，NB和SVM分类器的分类效果一般最好，MLP分类器的分类效果一般最差。此外，作者在为训练集选择消息时使用的阈值对召回率和准确率有显著影响，阈值为0.6产生最佳结果。使用PV-DBOW的结果略高于使用PV-DM。结合PV-DBOW和PV-DM表示的结果略低于单独使用PV-DBOW。最好的结果是由F1高达86.9%的NB分类器获得的。本文的主要独创性是确定自动注释阿尔及利亚方言语料库的正确参数。该注释基于也是自动构建的情感词典。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Web Information Systems COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

4.60

自引率

0.00%

发文量

期刊介绍： The Global Information Infrastructure is a daily reality. In spite of the many applications in all domains of our societies: e-business, e-commerce, e-learning, e-science, and e-government, for instance, and in spite of the tremendous advances by engineers and scientists, the seamless development of Web information systems and services remains a major challenge. The journal examines how current shared vision for the future is one of semantically-rich information and service oriented architecture for global information systems. This vision is at the convergence of progress in technologies such as XML, Web services, RDF, OWL, of multimedia, multimodal, and multilingual information retrieval, and of distributed, mobile and ubiquitous computing. Topicality While the International Journal of Web Information Systems covers a broad range of topics, the journal welcomes papers that provide a perspective on all aspects of Web information systems: Web semantics and Web dynamics, Web mining and searching, Web databases and Web data integration, Web-based commerce and e-business, Web collaboration and distributed computing, Internet computing and networks, performance of Web applications, and Web multimedia services and Web-based education.