Soufiane Hajbi , Omayma Amezian , Mouhssine Ziyad , Issame El Kaime , Redouan Korchyine , Younes Chihab
{"title":"Transformer-based model for moroccan Arabizi-to-Arabic transliteration using a semi-automatic annotated dataset","authors":"Soufiane Hajbi , Omayma Amezian , Mouhssine Ziyad , Issame El Kaime , Redouan Korchyine , Younes Chihab","doi":"10.1016/j.jjimei.2025.100351","DOIUrl":null,"url":null,"abstract":"<div><div>Language models have recently achieved state-of-the-art results in tasks such as translation, sentiment analysis, and text classification for high-resource languages. However, dedicated models for low-resource languages remain scarce, largely due to a lack of annotated data and linguistic resources. Most efforts focus on fine-tuning models trained on high-resource languages using limited data, resulting in a substantial performance gap. Moroccan Darija (MD), widely spoken in Morocco, lacks language resources and dedicated models. Additionally, MD texts often employ the Arabizi writing form, which combines Latin characters and numbers with Arabic script, further complicating Natural Language Processing (NLP) tasks. This work presents the first transformer-based model designed specifically for transliterating Moroccan Arabizi to Arabic. The approach leverages a character-level modeling architecture and a semi-automatically generated dataset containing over 33k word pairs, capturing significant linguistic diversity. The model achieves a state-of-the-art word transliteration accuracy (WTA) of 93 % and a character error rate (CER) of 4.73 % on unseen Moroccan Arabizi data, highlighting the potential of transformer models to improve transliteration accuracy for low-resource languages, particularly MD.</div></div>","PeriodicalId":100699,"journal":{"name":"International Journal of Information Management Data Insights","volume":"5 2","pages":"Article 100351"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Management Data Insights","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667096825000333","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Language models have recently achieved state-of-the-art results in tasks such as translation, sentiment analysis, and text classification for high-resource languages. However, dedicated models for low-resource languages remain scarce, largely due to a lack of annotated data and linguistic resources. Most efforts focus on fine-tuning models trained on high-resource languages using limited data, resulting in a substantial performance gap. Moroccan Darija (MD), widely spoken in Morocco, lacks language resources and dedicated models. Additionally, MD texts often employ the Arabizi writing form, which combines Latin characters and numbers with Arabic script, further complicating Natural Language Processing (NLP) tasks. This work presents the first transformer-based model designed specifically for transliterating Moroccan Arabizi to Arabic. The approach leverages a character-level modeling architecture and a semi-automatically generated dataset containing over 33k word pairs, capturing significant linguistic diversity. The model achieves a state-of-the-art word transliteration accuracy (WTA) of 93 % and a character error rate (CER) of 4.73 % on unseen Moroccan Arabizi data, highlighting the potential of transformer models to improve transliteration accuracy for low-resource languages, particularly MD.