A combined AraBERT and Voting Ensemble classifier model for Arabic sentiment analysis

Natural Language Processing Journal Pub Date : 2024-09-01 DOI:10.1016/j.nlp.2024.100100

Dhaou Ghoul , Jérémy Patrix , Gaël Lejeune , Jérôme Verny

{"title":"A combined AraBERT and Voting Ensemble classifier model for Arabic sentiment analysis","authors":"Dhaou Ghoul , Jérémy Patrix , Gaël Lejeune , Jérôme Verny","doi":"10.1016/j.nlp.2024.100100","DOIUrl":null,"url":null,"abstract":"<div><p>For sentiment analysis of short texts (e.g. movie reviews, tweets, etc.), one approach is to build machine learning models that can determine their tones (positive, negative, neutral). However, these natural language processing (NLP) studies are missing when there is a lack of high-quality and large-scale training data for specific languages such as Arabic. In this paper, we present three machine learning models designed to classify sentiment Arabic tweets developed for a Kaggle competition. We present a Voting Ensemble classifier taking advantage of both character-level and word-level features. We also propose an AraBERT (Arabic Bidirectional Encoder Representations from Transformers) model with preprocessing using Farasa Segmenter. Finally, we combine these first two approaches as a third approach (Voting Ensemble classifier using AraBERT embeddings). Performance measures of results show improvement over previous efforts for all models. The third model exhibits strong performance with a 73.98% F-score score. The work presented here could be useful for future studies and for new Arabic sentiment analysis online services or competitions.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"8 ","pages":"Article 100100"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000487/pdfft?md5=0cdd68616cd0023e6f056de98e086b2d&pid=1-s2.0-S2949719124000487-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000487","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

For sentiment analysis of short texts (e.g. movie reviews, tweets, etc.), one approach is to build machine learning models that can determine their tones (positive, negative, neutral). However, these natural language processing (NLP) studies are missing when there is a lack of high-quality and large-scale training data for specific languages such as Arabic. In this paper, we present three machine learning models designed to classify sentiment Arabic tweets developed for a Kaggle competition. We present a Voting Ensemble classifier taking advantage of both character-level and word-level features. We also propose an AraBERT (Arabic Bidirectional Encoder Representations from Transformers) model with preprocessing using Farasa Segmenter. Finally, we combine these first two approaches as a third approach (Voting Ensemble classifier using AraBERT embeddings). Performance measures of results show improvement over previous efforts for all models. The third model exhibits strong performance with a 73.98% F-score score. The work presented here could be useful for future studies and for new Arabic sentiment analysis online services or competitions.

查看原文本刊更多论文

用于阿拉伯语情感分析的 AraBERT 和投票集合分类器组合模型

对于短文（如电影评论、推特等）的情感分析，一种方法是建立机器学习模型，以确定其语气（正面、负面、中性）。然而，如果缺乏阿拉伯语等特定语言的高质量和大规模训练数据，这些自然语言处理 (NLP) 研究就会缺失。在本文中，我们介绍了为 Kaggle 竞赛开发的三种机器学习模型，旨在对阿拉伯语推文进行情感分类。我们利用字符级和单词级特征，提出了投票集合分类器。我们还提出了使用 Farasa Segmenter 进行预处理的 AraBERT（来自变换器的阿拉伯语双向编码器表示）模型。最后，我们将前两种方法合并为第三种方法（使用 AraBERT 嵌入的投票集合分类器）。结果表明，所有模型的性能都比以前有所提高。第三个模型表现出强劲的性能，F 分数高达 73.98%。本文介绍的工作对今后的研究以及新的阿拉伯语情感分析在线服务或竞赛都很有帮助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量