{"title":"Empowering Arabic diacritic restoration models with robustness, generalization, and minimal diacritization","authors":"Ruba Kharsa , Ashraf Elnagar , Sane Yagi","doi":"10.1016/j.ipm.2025.104345","DOIUrl":null,"url":null,"abstract":"<div><div>Arabic diacritization is essential for ensuring accurate pronunciation, clarity, and disambiguation of texts. It is a vital task in Arabic natural language processing. Despite substantial progress in the field, existing models struggle to generalize across the diverse forms of Arabic and perform poorly in noisy, error-prone environments. These limitations may be tied to problems in training data and, more critically, to insufficient contextual understanding. To address these gaps, we present SukounBERT.v2, a BERT-based Arabic diacritization system that is built using a multi-phase approach. We refine the Arabic Diacritization (AD) dataset by correcting spelling mistakes, introducing a line-splitting mechanism, and by injecting various forms of noise into the dataset, such as spelling errors, transliterated non-Arabic words, and nonsense tokens. Furthermore, we develop a context-aware training dataset that incorporates explicit diacritic markings and the diacritic naming of classical grammar treatises. Our work also introduces the Sukoun Corpus, a large-scale, diverse dataset comprising over 5.2 million lines and 71 million tokens that were sourced from Classical Arabic texts, Modern Standard Arabic writings, dictionaries, poetry, and purpose-built contextual sentences. Complementing this is a token-level mapping dictionary that enables minimal diacritization without sacrificing accuracy. This is a previously unreported feature in Arabic diacritization research. Trained on this enriched dataset, SukounBERT.v2 delivers state-of-the-art performance with over 55% relative reduction in Diacritic Error Rate (DER) and Word Error Rate (WER) compared to leading models. These results underscore the impact of context-aware and noise-resilient modeling in advancing the field of Arabic text processing.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 1","pages":"Article 104345"},"PeriodicalIF":6.9000,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325002869","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Arabic diacritization is essential for ensuring accurate pronunciation, clarity, and disambiguation of texts. It is a vital task in Arabic natural language processing. Despite substantial progress in the field, existing models struggle to generalize across the diverse forms of Arabic and perform poorly in noisy, error-prone environments. These limitations may be tied to problems in training data and, more critically, to insufficient contextual understanding. To address these gaps, we present SukounBERT.v2, a BERT-based Arabic diacritization system that is built using a multi-phase approach. We refine the Arabic Diacritization (AD) dataset by correcting spelling mistakes, introducing a line-splitting mechanism, and by injecting various forms of noise into the dataset, such as spelling errors, transliterated non-Arabic words, and nonsense tokens. Furthermore, we develop a context-aware training dataset that incorporates explicit diacritic markings and the diacritic naming of classical grammar treatises. Our work also introduces the Sukoun Corpus, a large-scale, diverse dataset comprising over 5.2 million lines and 71 million tokens that were sourced from Classical Arabic texts, Modern Standard Arabic writings, dictionaries, poetry, and purpose-built contextual sentences. Complementing this is a token-level mapping dictionary that enables minimal diacritization without sacrificing accuracy. This is a previously unreported feature in Arabic diacritization research. Trained on this enriched dataset, SukounBERT.v2 delivers state-of-the-art performance with over 55% relative reduction in Diacritic Error Rate (DER) and Word Error Rate (WER) compared to leading models. These results underscore the impact of context-aware and noise-resilient modeling in advancing the field of Arabic text processing.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.