Empowering Arabic diacritic restoration models with robustness, generalization, and minimal diacritization

IF 6.9 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2025-08-13 DOI:10.1016/j.ipm.2025.104345

Ruba Kharsa , Ashraf Elnagar , Sane Yagi

{"title":"Empowering Arabic diacritic restoration models with robustness, generalization, and minimal diacritization","authors":"Ruba Kharsa , Ashraf Elnagar , Sane Yagi","doi":"10.1016/j.ipm.2025.104345","DOIUrl":null,"url":null,"abstract":"<div><div>Arabic diacritization is essential for ensuring accurate pronunciation, clarity, and disambiguation of texts. It is a vital task in Arabic natural language processing. Despite substantial progress in the field, existing models struggle to generalize across the diverse forms of Arabic and perform poorly in noisy, error-prone environments. These limitations may be tied to problems in training data and, more critically, to insufficient contextual understanding. To address these gaps, we present SukounBERT.v2, a BERT-based Arabic diacritization system that is built using a multi-phase approach. We refine the Arabic Diacritization (AD) dataset by correcting spelling mistakes, introducing a line-splitting mechanism, and by injecting various forms of noise into the dataset, such as spelling errors, transliterated non-Arabic words, and nonsense tokens. Furthermore, we develop a context-aware training dataset that incorporates explicit diacritic markings and the diacritic naming of classical grammar treatises. Our work also introduces the Sukoun Corpus, a large-scale, diverse dataset comprising over 5.2 million lines and 71 million tokens that were sourced from Classical Arabic texts, Modern Standard Arabic writings, dictionaries, poetry, and purpose-built contextual sentences. Complementing this is a token-level mapping dictionary that enables minimal diacritization without sacrificing accuracy. This is a previously unreported feature in Arabic diacritization research. Trained on this enriched dataset, SukounBERT.v2 delivers state-of-the-art performance with over 55% relative reduction in Diacritic Error Rate (DER) and Word Error Rate (WER) compared to leading models. These results underscore the impact of context-aware and noise-resilient modeling in advancing the field of Arabic text processing.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 1","pages":"Article 104345"},"PeriodicalIF":6.9000,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325002869","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Arabic diacritization is essential for ensuring accurate pronunciation, clarity, and disambiguation of texts. It is a vital task in Arabic natural language processing. Despite substantial progress in the field, existing models struggle to generalize across the diverse forms of Arabic and perform poorly in noisy, error-prone environments. These limitations may be tied to problems in training data and, more critically, to insufficient contextual understanding. To address these gaps, we present SukounBERT.v2, a BERT-based Arabic diacritization system that is built using a multi-phase approach. We refine the Arabic Diacritization (AD) dataset by correcting spelling mistakes, introducing a line-splitting mechanism, and by injecting various forms of noise into the dataset, such as spelling errors, transliterated non-Arabic words, and nonsense tokens. Furthermore, we develop a context-aware training dataset that incorporates explicit diacritic markings and the diacritic naming of classical grammar treatises. Our work also introduces the Sukoun Corpus, a large-scale, diverse dataset comprising over 5.2 million lines and 71 million tokens that were sourced from Classical Arabic texts, Modern Standard Arabic writings, dictionaries, poetry, and purpose-built contextual sentences. Complementing this is a token-level mapping dictionary that enables minimal diacritization without sacrificing accuracy. This is a previously unreported feature in Arabic diacritization research. Trained on this enriched dataset, SukounBERT.v2 delivers state-of-the-art performance with over 55% relative reduction in Diacritic Error Rate (DER) and Word Error Rate (WER) compared to leading models. These results underscore the impact of context-aware and noise-resilient modeling in advancing the field of Arabic text processing.

查看原文本刊更多论文

赋予阿拉伯语变音符恢复模型鲁棒性，泛化和最小的变音符化

阿拉伯语的变音符对于确保文本的准确发音、清晰和消除歧义至关重要。它是阿拉伯语自然语言处理中的一项重要任务。尽管在该领域取得了实质性进展，但现有的模型很难推广到各种形式的阿拉伯语，并且在嘈杂、容易出错的环境中表现不佳。这些限制可能与训练数据中的问题有关，更重要的是，与上下文理解不足有关。为了解决这些差距，我们提出了SukounBERT。使用多阶段方法构建的基于bert的阿拉伯语变音符系统。我们通过纠正拼写错误、引入分行机制以及向数据集中注入各种形式的噪声（如拼写错误、音译的非阿拉伯单词和无意义的标记）来改进阿拉伯语Diacritization （AD）数据集。此外，我们开发了一个上下文感知的训练数据集，该数据集包含明确的变音符标记和经典语法论文的变音符命名。我们的工作还介绍了Sukoun语料库，这是一个大规模的、多样化的数据集，包括超过520万行和7100万个标记，这些标记来自古典阿拉伯语文本、现代标准阿拉伯语著作、词典、诗歌和专门构建的上下文句子。与之相辅相成的是一个标记级映射字典，它支持最小的变音符而不牺牲准确性。这是以前在阿拉伯语变音符研究中未报道的特征。在这个丰富的数据集上训练，SukounBERT。与领先的型号相比，v2提供了最先进的性能，相对降低了55%以上的变音符错误率（DER）和单词错误率（WER）。这些结果强调了上下文感知和噪声弹性建模在推进阿拉伯语文本处理领域的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.