Investigating the relevance of Arabic text classification datasets based on supervised learning

Q1 Engineering

Journal of Electronic Science and Technology Pub Date : 2022-06-01 DOI:10.1016/j.jnlest.2022.100160

Ahmad Hussein Ababneh

{"title":"Investigating the relevance of Arabic text classification datasets based on supervised learning","authors":"Ahmad Hussein Ababneh","doi":"10.1016/j.jnlest.2022.100160","DOIUrl":null,"url":null,"abstract":"<div><p>Training and testing different models in the field of text classification mainly depend on the pre-classified text document datasets. Recently, seven datasets have emerged for Arabic text classification, including Single-Label Arabic News Articles Dataset (SANAD), Khaleej, Arabiya, Akhbarona, KALIMAT, Waten2004, and Khaleej2004. This study investigates which of these datasets can provide significant training and fair evaluation for text classification. In this investigation, well-known and accurate learning models are used, including naive Bayes, random forest, <em>K</em>-nearest neighbor, support vector machines, and logistic regression models. We present relevance and time measures of training the models with these datasets to enable Arabic language researchers to select the appropriate dataset to use based on a solid basis of comparison. The performances of the five learning models across the seven datasets are measured and compared with the performance of the same models trained on a well-known English language dataset. The analysis of the relevance and time scores shows that training the support vector machine model on Khaleej and Arabiya obtained the most significant results in the shortest amount of time, with the accuracy of 82%.</p></div>","PeriodicalId":53467,"journal":{"name":"Journal of Electronic Science and Technology","volume":"20 2","pages":"Article 100160"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1674862X22000131/pdfft?md5=f80d190efa3ad8651ea8b413ce044394&pid=1-s2.0-S1674862X22000131-main.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Electronic Science and Technology","FirstCategoryId":"95","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1674862X22000131","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Engineering","Score":null,"Total":0}

引用次数: 2

Abstract

Training and testing different models in the field of text classification mainly depend on the pre-classified text document datasets. Recently, seven datasets have emerged for Arabic text classification, including Single-Label Arabic News Articles Dataset (SANAD), Khaleej, Arabiya, Akhbarona, KALIMAT, Waten2004, and Khaleej2004. This study investigates which of these datasets can provide significant training and fair evaluation for text classification. In this investigation, well-known and accurate learning models are used, including naive Bayes, random forest, K-nearest neighbor, support vector machines, and logistic regression models. We present relevance and time measures of training the models with these datasets to enable Arabic language researchers to select the appropriate dataset to use based on a solid basis of comparison. The performances of the five learning models across the seven datasets are measured and compared with the performance of the same models trained on a well-known English language dataset. The analysis of the relevance and time scores shows that training the support vector machine model on Khaleej and Arabiya obtained the most significant results in the shortest amount of time, with the accuracy of 82%.

查看原文本刊更多论文

基于监督学习的阿拉伯语文本分类数据集的相关性研究

文本分类领域中不同模型的训练和测试主要依赖于预分类的文本文档数据集。最近，出现了七个用于阿拉伯语文本分类的数据集，包括单标签阿拉伯语新闻文章数据集(SANAD)、Khaleej、Arabiya、Akhbarona、KALIMAT、Waten2004和Khaleej2004。本研究探讨了哪些数据集可以为文本分类提供重要的训练和公平的评估。在这项研究中，使用了众所周知的和准确的学习模型，包括朴素贝叶斯，随机森林，k近邻，支持向量机和逻辑回归模型。我们提出了用这些数据集训练模型的相关性和时间度量，使阿拉伯语研究人员能够在坚实的比较基础上选择合适的数据集来使用。这五个学习模型在七个数据集上的表现被测量，并与在一个知名的英语语言数据集上训练的相同模型的表现进行比较。对相关性和时间分数的分析表明，在Khaleej和Arabiya上训练支持向量机模型在最短的时间内获得了最显著的结果，准确率达到82%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Electronic Science and Technology Engineering-Electrical and Electronic Engineering

CiteScore

4.30

自引率

0.00%

发文量

1362

审稿时长

99 days

期刊介绍： JEST (International) covers the state-of-the-art achievements in electronic science and technology, including the most highlight areas: ¨ Communication Technology ¨ Computer Science and Information Technology ¨ Information and Network Security ¨ Bioelectronics and Biomedicine ¨ Neural Networks and Intelligent Systems ¨ Electronic Systems and Array Processing ¨ Optoelectronic and Photonic Technologies ¨ Electronic Materials and Devices ¨ Sensing and Measurement ¨ Signal Processing and Image Processing JEST (International) is dedicated to building an open, high-level academic journal supported by researchers, professionals, and academicians. The Journal has been fully indexed by Ei INSPEC and has published, with great honor, the contributions from more than 20 countries and regions in the world.