Transformer-based model for moroccan Arabizi-to-Arabic transliteration using a semi-automatic annotated dataset

International Journal of Information Management Data Insights Pub Date : 2025-06-19 DOI:10.1016/j.jjimei.2025.100351

Soufiane Hajbi , Omayma Amezian , Mouhssine Ziyad , Issame El Kaime , Redouan Korchyine , Younes Chihab

{"title":"Transformer-based model for moroccan Arabizi-to-Arabic transliteration using a semi-automatic annotated dataset","authors":"Soufiane Hajbi , Omayma Amezian , Mouhssine Ziyad , Issame El Kaime , Redouan Korchyine , Younes Chihab","doi":"10.1016/j.jjimei.2025.100351","DOIUrl":null,"url":null,"abstract":"<div><div>Language models have recently achieved state-of-the-art results in tasks such as translation, sentiment analysis, and text classification for high-resource languages. However, dedicated models for low-resource languages remain scarce, largely due to a lack of annotated data and linguistic resources. Most efforts focus on fine-tuning models trained on high-resource languages using limited data, resulting in a substantial performance gap. Moroccan Darija (MD), widely spoken in Morocco, lacks language resources and dedicated models. Additionally, MD texts often employ the Arabizi writing form, which combines Latin characters and numbers with Arabic script, further complicating Natural Language Processing (NLP) tasks. This work presents the first transformer-based model designed specifically for transliterating Moroccan Arabizi to Arabic. The approach leverages a character-level modeling architecture and a semi-automatically generated dataset containing over 33k word pairs, capturing significant linguistic diversity. The model achieves a state-of-the-art word transliteration accuracy (WTA) of 93 % and a character error rate (CER) of 4.73 % on unseen Moroccan Arabizi data, highlighting the potential of transformer models to improve transliteration accuracy for low-resource languages, particularly MD.</div></div>","PeriodicalId":100699,"journal":{"name":"International Journal of Information Management Data Insights","volume":"5 2","pages":"Article 100351"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Management Data Insights","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667096825000333","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Language models have recently achieved state-of-the-art results in tasks such as translation, sentiment analysis, and text classification for high-resource languages. However, dedicated models for low-resource languages remain scarce, largely due to a lack of annotated data and linguistic resources. Most efforts focus on fine-tuning models trained on high-resource languages using limited data, resulting in a substantial performance gap. Moroccan Darija (MD), widely spoken in Morocco, lacks language resources and dedicated models. Additionally, MD texts often employ the Arabizi writing form, which combines Latin characters and numbers with Arabic script, further complicating Natural Language Processing (NLP) tasks. This work presents the first transformer-based model designed specifically for transliterating Moroccan Arabizi to Arabic. The approach leverages a character-level modeling architecture and a semi-automatically generated dataset containing over 33k word pairs, capturing significant linguistic diversity. The model achieves a state-of-the-art word transliteration accuracy (WTA) of 93 % and a character error rate (CER) of 4.73 % on unseen Moroccan Arabizi data, highlighting the potential of transformer models to improve transliteration accuracy for low-resource languages, particularly MD.

查看原文本刊更多论文

基于转换器的摩洛哥阿拉伯语到阿拉伯语转写模型，使用半自动注释数据集

语言模型最近在翻译、情感分析和高资源语言的文本分类等任务中取得了最先进的结果。然而，用于低资源语言的专用模型仍然很少，这主要是由于缺乏带注释的数据和语言资源。大多数努力都集中在使用有限的数据对高资源语言训练的模型进行微调，这导致了巨大的性能差距。摩洛哥语（MD）在摩洛哥广泛使用，但缺乏语言资源和专用模式。此外，MD文本通常采用阿拉伯语书写形式，它将拉丁字符和数字与阿拉伯文字结合在一起，进一步复杂化了自然语言处理（NLP）任务。这项工作提出了第一个基于变压器的模型，专门为将摩洛哥阿拉伯语音译为阿拉伯语而设计。该方法利用字符级建模架构和包含超过33k个词对的半自动生成数据集，捕获了显著的语言多样性。该模型在未见过的摩洛哥阿拉伯语数据上实现了93%的最先进的单词音译准确率（WTA）和4.73%的字符错误率（CER），突出了变形模型在提高低资源语言（特别是MD）的音译准确率方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Information Management Data Insights

CiteScore

19.20

自引率

0.00%

发文量