利用AMCFFL技术，通过RMuBERT和SSL对阿拉伯语新闻帖子进行基准测试，分析阿拉伯语情绪

IF 4.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Egyptian Informatics Journal Pub Date : 2025-01-15 DOI:10.1016/j.eij.2024.100601

Mustafa Mhamed , Richard Sutcliffe , Jun Feng

{"title":"利用AMCFFL技术，通过RMuBERT和SSL对阿拉伯语新闻帖子进行基准测试，分析阿拉伯语情绪","authors":"Mustafa Mhamed , Richard Sutcliffe , Jun Feng","doi":"10.1016/j.eij.2024.100601","DOIUrl":null,"url":null,"abstract":"<div><div>Sentiment analysis aims to extract emotions from textual data; sentiment analysis and text recognition are two of the most common tasks associated with natural language processing. Emergent technologies have been developed and employed in various fields, including marketing, health care, and policy making. However, with the growth of social media platforms and the flow of data, especially in the Arabic language, substantial difficulties have emerged that call for the creation of new frameworks to address problems, such as the lack of datasets related to news platforms, the complicated formation of the Arabic language, and complications with classifying, and system challenges, whether in machine learning, deep learning, or online analysis tools. This paper provides a new framework that helps address ASA challenges and work on various tasks based on the state-of-the-art ASA. First, it presents a new collection named (ANP5) from Arabic news posts from several Arabic platforms, then uses SSL with AMCFFL technique to analyze the Arabic sentiment and generate a second dataset (ANPS2). Next, applied ML classifiers, RF and SVM, do the best among the other classifiers, with an accuracy of 82.00%; however, the measurement distributions for each class are different (Experiment 1). Following that, DL models, BIGRU, CNN-LSTM, LSTM, and CNN, had accuracies of 88.10%, 89.30%, 89.85%, and 90.10% (Experiment 2). Experiments 1 and 2 represent the initial benchmark classification as the first baseline. Afterward, a new RMuBERT Model was developed and compared with four transformers on the two datasets: ANPS2 accuracy (90.87%) and ANP5 (90.33%). RMuBERT performed better than the baselines (Experiment 3). Further testing of RMuBERT on various Arabic corpora with different classes, lengths, and sizes: ArSarcasm (3C), STD (2C), AJGT (2C), and AAQ (2C), revealed accuracies of 77.76%, 91.79%, 94.07%, and 93.48%, respectively. Still, RMuBERT performed better than the baselines (Experiment 4). Finally, on the largest Arabic sentiment corpora with six million Arabic tweets, the performance is up to (91.12%); RMuBERT works efficiently with less training time (Experiment 5).</div></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":"29 ","pages":"Article 100601"},"PeriodicalIF":4.3000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmark Arabic news posts and analyzes Arabic sentiment through RMuBERT and SSL with AMCFFL technique\",\"authors\":\"Mustafa Mhamed , Richard Sutcliffe , Jun Feng\",\"doi\":\"10.1016/j.eij.2024.100601\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Sentiment analysis aims to extract emotions from textual data; sentiment analysis and text recognition are two of the most common tasks associated with natural language processing. Emergent technologies have been developed and employed in various fields, including marketing, health care, and policy making. However, with the growth of social media platforms and the flow of data, especially in the Arabic language, substantial difficulties have emerged that call for the creation of new frameworks to address problems, such as the lack of datasets related to news platforms, the complicated formation of the Arabic language, and complications with classifying, and system challenges, whether in machine learning, deep learning, or online analysis tools. This paper provides a new framework that helps address ASA challenges and work on various tasks based on the state-of-the-art ASA. First, it presents a new collection named (ANP5) from Arabic news posts from several Arabic platforms, then uses SSL with AMCFFL technique to analyze the Arabic sentiment and generate a second dataset (ANPS2). Next, applied ML classifiers, RF and SVM, do the best among the other classifiers, with an accuracy of 82.00%; however, the measurement distributions for each class are different (Experiment 1). Following that, DL models, BIGRU, CNN-LSTM, LSTM, and CNN, had accuracies of 88.10%, 89.30%, 89.85%, and 90.10% (Experiment 2). Experiments 1 and 2 represent the initial benchmark classification as the first baseline. Afterward, a new RMuBERT Model was developed and compared with four transformers on the two datasets: ANPS2 accuracy (90.87%) and ANP5 (90.33%). RMuBERT performed better than the baselines (Experiment 3). Further testing of RMuBERT on various Arabic corpora with different classes, lengths, and sizes: ArSarcasm (3C), STD (2C), AJGT (2C), and AAQ (2C), revealed accuracies of 77.76%, 91.79%, 94.07%, and 93.48%, respectively. Still, RMuBERT performed better than the baselines (Experiment 4). Finally, on the largest Arabic sentiment corpora with six million Arabic tweets, the performance is up to (91.12%); RMuBERT works efficiently with less training time (Experiment 5).</div></div>\",\"PeriodicalId\":56010,\"journal\":{\"name\":\"Egyptian Informatics Journal\",\"volume\":\"29 \",\"pages\":\"Article 100601\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Egyptian Informatics Journal\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1110866524001646\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866524001646","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

情感分析旨在从文本数据中提取情感；情感分析和文本识别是与自然语言处理相关的两个最常见的任务。新兴技术已被开发并应用于各个领域，包括市场营销、医疗保健和政策制定。然而，随着社交媒体平台和数据流的增长，特别是阿拉伯语的数据流，出现了实质性的困难，需要创建新的框架来解决问题，例如缺乏与新闻平台相关的数据集，阿拉伯语的复杂形成，分类的复杂性，以及系统挑战，无论是在机器学习，深度学习还是在线分析工具中。本文提供了一个新的框架，有助于解决ASA的挑战，并在基于最先进的ASA的各种任务上工作。首先，它从几个阿拉伯平台的阿拉伯语新闻帖子中提出了一个名为（ANP5）的新集合，然后使用SSL和AMCFFL技术分析阿拉伯语情感并生成第二个数据集（ANPS2）。其次，应用ML分类器，RF和SVM，在其他分类器中表现最好，准确率为82.00%；然而，每个类别的测量分布是不同的（实验1）。随后，深度学习模型BIGRU、CNN-LSTM、LSTM和CNN的准确率分别为88.10%、89.30%、89.85%和90.10%（实验2）。实验1和2代表初始基准分类作为第一基线。随后，建立了一个新的RMuBERT模型，并在ANPS2（90.87%）和ANP5（90.33%）两个数据集上与4个变压器进行了比较。RMuBERT的表现优于基线（实验3）。在ArSarcasm （3C）、STD （2C）、AJGT （2C）和AAQ （2C）等不同类别、长度和大小的阿拉伯语料库上进一步测试RMuBERT，准确率分别为77.76%、91.79%、94.07%和93.48%。尽管如此，RMuBERT的表现仍优于基线（实验4）。最后，在包含600万条阿拉伯语推文的最大阿拉伯语情感语料库上，性能高达（91.12%）；RMuBERT工作效率高，训练时间短（实验5）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmark Arabic news posts and analyzes Arabic sentiment through RMuBERT and SSL with AMCFFL technique

Sentiment analysis aims to extract emotions from textual data; sentiment analysis and text recognition are two of the most common tasks associated with natural language processing. Emergent technologies have been developed and employed in various fields, including marketing, health care, and policy making. However, with the growth of social media platforms and the flow of data, especially in the Arabic language, substantial difficulties have emerged that call for the creation of new frameworks to address problems, such as the lack of datasets related to news platforms, the complicated formation of the Arabic language, and complications with classifying, and system challenges, whether in machine learning, deep learning, or online analysis tools. This paper provides a new framework that helps address ASA challenges and work on various tasks based on the state-of-the-art ASA. First, it presents a new collection named (ANP5) from Arabic news posts from several Arabic platforms, then uses SSL with AMCFFL technique to analyze the Arabic sentiment and generate a second dataset (ANPS2). Next, applied ML classifiers, RF and SVM, do the best among the other classifiers, with an accuracy of 82.00%; however, the measurement distributions for each class are different (Experiment 1). Following that, DL models, BIGRU, CNN-LSTM, LSTM, and CNN, had accuracies of 88.10%, 89.30%, 89.85%, and 90.10% (Experiment 2). Experiments 1 and 2 represent the initial benchmark classification as the first baseline. Afterward, a new RMuBERT Model was developed and compared with four transformers on the two datasets: ANPS2 accuracy (90.87%) and ANP5 (90.33%). RMuBERT performed better than the baselines (Experiment 3). Further testing of RMuBERT on various Arabic corpora with different classes, lengths, and sizes: ArSarcasm (3C), STD (2C), AJGT (2C), and AAQ (2C), revealed accuracies of 77.76%, 91.79%, 94.07%, and 93.48%, respectively. Still, RMuBERT performed better than the baselines (Experiment 4). Finally, on the largest Arabic sentiment corpora with six million Arabic tweets, the performance is up to (91.12%); RMuBERT works efficiently with less training time (Experiment 5).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Egyptian Informatics Journal Decision Sciences-Management Science and Operations Research

CiteScore

11.10

自引率

1.90%

发文量

审稿时长

110 days

期刊介绍： The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.