基于词嵌入和深度学习的阿拉伯语情感分析

Comput. Pub Date : 2023-06-19 DOI:10.3390/computers12060126

Nasrin Elhassan, G. Varone, Rami Ahmed, M. Gogate, K. Dashtipour, Hani Almoamari, M. El-Affendi, B. Al-Tamimi, Faisal Albalwy, Amir Hussain

{"title":"基于词嵌入和深度学习的阿拉伯语情感分析","authors":"Nasrin Elhassan, G. Varone, Rami Ahmed, M. Gogate, K. Dashtipour, Hani Almoamari, M. El-Affendi, B. Al-Tamimi, Faisal Albalwy, Amir Hussain","doi":"10.3390/computers12060126","DOIUrl":null,"url":null,"abstract":"Social media networks have grown exponentially over the last two decades, providing the opportunity for users of the internet to communicate and exchange ideas on a variety of topics. The outcome is that opinion mining plays a crucial role in analyzing user opinions and applying these to guide choices, making it one of the most popular areas of research in the field of natural language processing. Despite the fact that several languages, including English, have been the subjects of several studies, not much has been conducted in the area of the Arabic language. The morphological complexities and various dialects of the language make semantic analysis particularly challenging. Moreover, the lack of accurate pre-processing tools and limited resources are constraining factors. This novel study was motivated by the accomplishments of deep learning algorithms and word embeddings in the field of English sentiment analysis. Extensive experiments were conducted based on supervised machine learning in which word embeddings were exploited to determine the sentiment of Arabic reviews. Three deep learning algorithms, convolutional neural networks (CNNs), long short-term memory (LSTM), and a hybrid CNN-LSTM, were introduced. The models used features learned by word embeddings such as Word2Vec and fastText rather than hand-crafted features. The models were tested using two benchmark Arabic datasets: Hotel Arabic Reviews Dataset (HARD) for hotel reviews and Large-Scale Arabic Book Reviews (LARB) for book reviews, with different setups. Comparative experiments utilized the three models with two-word embeddings and different setups of the datasets. The main novelty of this study is to explore the effectiveness of using various word embeddings and different setups of benchmark datasets relating to balance, imbalance, and binary and multi-classification aspects. Findings showed that the best results were obtained in most cases when applying the fastText word embedding using the HARD 2-imbalance dataset for all three proposed models: CNN, LSTM, and CNN-LSTM. Further, the proposed CNN model outperformed the LSTM and CNN-LSTM models for the benchmark HARD dataset by achieving 94.69%, 94.63%, and 94.54% accuracy with fastText, respectively. Although the worst results were obtained for the LABR 3-imbalance dataset using both Word2Vec and FastText, they still outperformed other researchers’ state-of-the-art outcomes applying the same dataset.","PeriodicalId":10526,"journal":{"name":"Comput.","volume":"66 1","pages":"126"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Arabic Sentiment Analysis Based on Word Embeddings and Deep Learning\",\"authors\":\"Nasrin Elhassan, G. Varone, Rami Ahmed, M. Gogate, K. Dashtipour, Hani Almoamari, M. El-Affendi, B. Al-Tamimi, Faisal Albalwy, Amir Hussain\",\"doi\":\"10.3390/computers12060126\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Social media networks have grown exponentially over the last two decades, providing the opportunity for users of the internet to communicate and exchange ideas on a variety of topics. The outcome is that opinion mining plays a crucial role in analyzing user opinions and applying these to guide choices, making it one of the most popular areas of research in the field of natural language processing. Despite the fact that several languages, including English, have been the subjects of several studies, not much has been conducted in the area of the Arabic language. The morphological complexities and various dialects of the language make semantic analysis particularly challenging. Moreover, the lack of accurate pre-processing tools and limited resources are constraining factors. This novel study was motivated by the accomplishments of deep learning algorithms and word embeddings in the field of English sentiment analysis. Extensive experiments were conducted based on supervised machine learning in which word embeddings were exploited to determine the sentiment of Arabic reviews. Three deep learning algorithms, convolutional neural networks (CNNs), long short-term memory (LSTM), and a hybrid CNN-LSTM, were introduced. The models used features learned by word embeddings such as Word2Vec and fastText rather than hand-crafted features. The models were tested using two benchmark Arabic datasets: Hotel Arabic Reviews Dataset (HARD) for hotel reviews and Large-Scale Arabic Book Reviews (LARB) for book reviews, with different setups. Comparative experiments utilized the three models with two-word embeddings and different setups of the datasets. The main novelty of this study is to explore the effectiveness of using various word embeddings and different setups of benchmark datasets relating to balance, imbalance, and binary and multi-classification aspects. Findings showed that the best results were obtained in most cases when applying the fastText word embedding using the HARD 2-imbalance dataset for all three proposed models: CNN, LSTM, and CNN-LSTM. Further, the proposed CNN model outperformed the LSTM and CNN-LSTM models for the benchmark HARD dataset by achieving 94.69%, 94.63%, and 94.54% accuracy with fastText, respectively. Although the worst results were obtained for the LABR 3-imbalance dataset using both Word2Vec and FastText, they still outperformed other researchers’ state-of-the-art outcomes applying the same dataset.\",\"PeriodicalId\":10526,\"journal\":{\"name\":\"Comput.\",\"volume\":\"66 1\",\"pages\":\"126\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/computers12060126\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12060126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在过去的二十年里，社交媒体网络呈指数级增长，为互联网用户提供了就各种主题进行沟通和交换意见的机会。结果表明，意见挖掘在分析用户意见并应用这些意见来指导选择方面起着至关重要的作用，使其成为自然语言处理领域中最受欢迎的研究领域之一。尽管包括英语在内的几种语言已成为若干项研究的主题，但在阿拉伯文领域进行的研究却不多。语言的形态复杂性和各种方言使语义分析特别具有挑战性。此外，缺乏准确的预处理工具和有限的资源是制约因素。这项新颖的研究是由深度学习算法和词嵌入在英语情感分析领域的成就所激发的。基于监督机器学习进行了大量实验，其中利用词嵌入来确定阿拉伯语评论的情绪。介绍了卷积神经网络(cnn)、长短期记忆(LSTM)和CNN-LSTM混合算法这三种深度学习算法。这些模型使用的是通过Word2Vec和fastText等词嵌入学习到的特征，而不是手工制作的特征。这些模型使用两个基准阿拉伯语数据集进行测试:酒店阿拉伯语评论数据集(HARD)用于酒店评论，大规模阿拉伯语书评(LARB)用于书评，设置不同。对比实验使用了三种具有两词嵌入的模型和不同的数据集设置。本研究的主要新颖之处在于探索使用不同的词嵌入和不同的基准数据集设置在平衡、不平衡、二值和多分类方面的有效性。研究结果表明，对于CNN、LSTM和CNN-LSTM这三种模型，在大多数情况下，使用HARD 2-imbalance数据集应用fastText词嵌入获得了最好的结果。此外，本文提出的CNN模型在HARD基准数据集上的准确率分别达到94.69%、94.63%和94.54%，优于LSTM和CNN-LSTM模型。尽管使用Word2Vec和FastText的LABR 3-失衡数据集获得了最差的结果，但它们仍然优于使用相同数据集的其他研究人员的最新结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Arabic Sentiment Analysis Based on Word Embeddings and Deep Learning

Social media networks have grown exponentially over the last two decades, providing the opportunity for users of the internet to communicate and exchange ideas on a variety of topics. The outcome is that opinion mining plays a crucial role in analyzing user opinions and applying these to guide choices, making it one of the most popular areas of research in the field of natural language processing. Despite the fact that several languages, including English, have been the subjects of several studies, not much has been conducted in the area of the Arabic language. The morphological complexities and various dialects of the language make semantic analysis particularly challenging. Moreover, the lack of accurate pre-processing tools and limited resources are constraining factors. This novel study was motivated by the accomplishments of deep learning algorithms and word embeddings in the field of English sentiment analysis. Extensive experiments were conducted based on supervised machine learning in which word embeddings were exploited to determine the sentiment of Arabic reviews. Three deep learning algorithms, convolutional neural networks (CNNs), long short-term memory (LSTM), and a hybrid CNN-LSTM, were introduced. The models used features learned by word embeddings such as Word2Vec and fastText rather than hand-crafted features. The models were tested using two benchmark Arabic datasets: Hotel Arabic Reviews Dataset (HARD) for hotel reviews and Large-Scale Arabic Book Reviews (LARB) for book reviews, with different setups. Comparative experiments utilized the three models with two-word embeddings and different setups of the datasets. The main novelty of this study is to explore the effectiveness of using various word embeddings and different setups of benchmark datasets relating to balance, imbalance, and binary and multi-classification aspects. Findings showed that the best results were obtained in most cases when applying the fastText word embedding using the HARD 2-imbalance dataset for all three proposed models: CNN, LSTM, and CNN-LSTM. Further, the proposed CNN model outperformed the LSTM and CNN-LSTM models for the benchmark HARD dataset by achieving 94.69%, 94.63%, and 94.54% accuracy with fastText, respectively. Although the worst results were obtained for the LABR 3-imbalance dataset using both Word2Vec and FastText, they still outperformed other researchers’ state-of-the-art outcomes applying the same dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Comput.

自引率

0.00%

发文量