{"title":"An application of textual document classification for Arabic governmental correspondence","authors":"Khaled Alzamel, Manayer Alajmi","doi":"10.1016/j.kjs.2024.100299","DOIUrl":null,"url":null,"abstract":"<div><p>The automation of classifying Arabic documents is becoming increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of the most significant fields in artificial intelligence (AI) thanks to recent advances in introducing transformer-based models. Transformers facilitate the use of reusable models by using pre-trained models (PTMs). This study aims to fine-tune monolingual (AraBERT (Antoun et al., 2020)), bilingual (GigaBERT (Lan et al., 2020)), and multilingual (XLM-RoBERTa (Conneau et al., 2020)) transformer-based encoder models to classify official Arabic correspondence in pre-defined classes and compare their predictive performance in terms of accuracy, using a new balanced dataset. The new balanced dataset has 22,741 Arabic texts and is categorized into six categories labeled with the most common ministries’ names. The results in this study show that GigaBERT achieved the highest accuracy rate of 98%. The implemented models may contribute to the domain of information systems (ISs) to facilitate the classification process in ministries without human intervention.</p></div>","PeriodicalId":17848,"journal":{"name":"Kuwait Journal of Science","volume":"52 1","pages":"Article 100299"},"PeriodicalIF":1.2000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S230741082400124X/pdfft?md5=7773964c72c4b5c247bc08c319486326&pid=1-s2.0-S230741082400124X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Kuwait Journal of Science","FirstCategoryId":"103","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S230741082400124X","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
The automation of classifying Arabic documents is becoming increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of the most significant fields in artificial intelligence (AI) thanks to recent advances in introducing transformer-based models. Transformers facilitate the use of reusable models by using pre-trained models (PTMs). This study aims to fine-tune monolingual (AraBERT (Antoun et al., 2020)), bilingual (GigaBERT (Lan et al., 2020)), and multilingual (XLM-RoBERTa (Conneau et al., 2020)) transformer-based encoder models to classify official Arabic correspondence in pre-defined classes and compare their predictive performance in terms of accuracy, using a new balanced dataset. The new balanced dataset has 22,741 Arabic texts and is categorized into six categories labeled with the most common ministries’ names. The results in this study show that GigaBERT achieved the highest accuracy rate of 98%. The implemented models may contribute to the domain of information systems (ISs) to facilitate the classification process in ministries without human intervention.
期刊介绍:
Kuwait Journal of Science (KJS) is indexed and abstracted by major publishing houses such as Chemical Abstract, Science Citation Index, Current contents, Mathematics Abstract, Micribiological Abstracts etc. KJS publishes peer-review articles in various fields of Science including Mathematics, Computer Science, Physics, Statistics, Biology, Chemistry and Earth & Environmental Sciences. In addition, it also aims to bring the results of scientific research carried out under a variety of intellectual traditions and organizations to the attention of specialized scholarly readership. As such, the publisher expects the submission of original manuscripts which contain analysis and solutions about important theoretical, empirical and normative issues.