{"title":"A Novel Approach for Mitigating Class Imbalance in Arabic Text Classification","authors":"Emad Nabil;Abdelrahman Ezzeldin Nagib;Mena Hany;Safiullah Faizullah;Wael Hassan Gomaa","doi":"10.1109/ACCESS.2025.3604427","DOIUrl":null,"url":null,"abstract":"Natural language processing (NLP) has become somewhat well-known because of its many uses; deep neural networks have driven major developments. Still, there are difficulties, especially in Arabic NLP, where the language’s large vocabulary of over 12 million words and several dialects cause special issues. Arabic has a large speaker base; however, NLP studies in this language find challenges, particularly with class imbalance. Many times, standard class balancing methods overlook intra-class similarity, a crucial element influencing model training. We present a new approach for computing intra-class similarity using cosine similarity and embedding models to find ideal class weights for model training, hence bridging this difference. On two benchmark datasets—the Arabic Semantic Question Similarity dataset (NSURL) and the Microsoft Research Paragraph Corpus (MRPC)—we assessed the proposed approach. With an accuracy of state-of-the-art 83.25% on the MRPC dataset and 96.931% on the NSURL dataset, the proposed approach proved successful in improving model performance in Arabic text classification.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"152870-152889"},"PeriodicalIF":3.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11145759","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11145759/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Natural language processing (NLP) has become somewhat well-known because of its many uses; deep neural networks have driven major developments. Still, there are difficulties, especially in Arabic NLP, where the language’s large vocabulary of over 12 million words and several dialects cause special issues. Arabic has a large speaker base; however, NLP studies in this language find challenges, particularly with class imbalance. Many times, standard class balancing methods overlook intra-class similarity, a crucial element influencing model training. We present a new approach for computing intra-class similarity using cosine similarity and embedding models to find ideal class weights for model training, hence bridging this difference. On two benchmark datasets—the Arabic Semantic Question Similarity dataset (NSURL) and the Microsoft Research Paragraph Corpus (MRPC)—we assessed the proposed approach. With an accuracy of state-of-the-art 83.25% on the MRPC dataset and 96.931% on the NSURL dataset, the proposed approach proved successful in improving model performance in Arabic text classification.
IEEE AccessCOMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
9.80
自引率
7.70%
发文量
6673
审稿时长
6 weeks
期刊介绍:
IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest.
IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on:
Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals.
Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering.
Development of new or improved fabrication or manufacturing techniques.
Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.