{"title":"A combined AraBERT and Voting Ensemble classifier model for Arabic sentiment analysis","authors":"Dhaou Ghoul , Jérémy Patrix , Gaël Lejeune , Jérôme Verny","doi":"10.1016/j.nlp.2024.100100","DOIUrl":null,"url":null,"abstract":"<div><p>For sentiment analysis of short texts (e.g. movie reviews, tweets, etc.), one approach is to build machine learning models that can determine their tones (positive, negative, neutral). However, these natural language processing (NLP) studies are missing when there is a lack of high-quality and large-scale training data for specific languages such as Arabic. In this paper, we present three machine learning models designed to classify sentiment Arabic tweets developed for a Kaggle competition. We present a Voting Ensemble classifier taking advantage of both character-level and word-level features. We also propose an AraBERT (Arabic Bidirectional Encoder Representations from Transformers) model with preprocessing using Farasa Segmenter. Finally, we combine these first two approaches as a third approach (Voting Ensemble classifier using AraBERT embeddings). Performance measures of results show improvement over previous efforts for all models. The third model exhibits strong performance with a 73.98% F-score score. The work presented here could be useful for future studies and for new Arabic sentiment analysis online services or competitions.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"8 ","pages":"Article 100100"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000487/pdfft?md5=0cdd68616cd0023e6f056de98e086b2d&pid=1-s2.0-S2949719124000487-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000487","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
For sentiment analysis of short texts (e.g. movie reviews, tweets, etc.), one approach is to build machine learning models that can determine their tones (positive, negative, neutral). However, these natural language processing (NLP) studies are missing when there is a lack of high-quality and large-scale training data for specific languages such as Arabic. In this paper, we present three machine learning models designed to classify sentiment Arabic tweets developed for a Kaggle competition. We present a Voting Ensemble classifier taking advantage of both character-level and word-level features. We also propose an AraBERT (Arabic Bidirectional Encoder Representations from Transformers) model with preprocessing using Farasa Segmenter. Finally, we combine these first two approaches as a third approach (Voting Ensemble classifier using AraBERT embeddings). Performance measures of results show improvement over previous efforts for all models. The third model exhibits strong performance with a 73.98% F-score score. The work presented here could be useful for future studies and for new Arabic sentiment analysis online services or competitions.