Soufiane Hajbi , Omayma Amezian , Mouhssine Ziyad , Issame El Kaime , Redouan Korchyine , Younes Chihab
{"title":"基于转换器的摩洛哥阿拉伯语到阿拉伯语转写模型,使用半自动注释数据集","authors":"Soufiane Hajbi , Omayma Amezian , Mouhssine Ziyad , Issame El Kaime , Redouan Korchyine , Younes Chihab","doi":"10.1016/j.jjimei.2025.100351","DOIUrl":null,"url":null,"abstract":"<div><div>Language models have recently achieved state-of-the-art results in tasks such as translation, sentiment analysis, and text classification for high-resource languages. However, dedicated models for low-resource languages remain scarce, largely due to a lack of annotated data and linguistic resources. Most efforts focus on fine-tuning models trained on high-resource languages using limited data, resulting in a substantial performance gap. Moroccan Darija (MD), widely spoken in Morocco, lacks language resources and dedicated models. Additionally, MD texts often employ the Arabizi writing form, which combines Latin characters and numbers with Arabic script, further complicating Natural Language Processing (NLP) tasks. This work presents the first transformer-based model designed specifically for transliterating Moroccan Arabizi to Arabic. The approach leverages a character-level modeling architecture and a semi-automatically generated dataset containing over 33k word pairs, capturing significant linguistic diversity. The model achieves a state-of-the-art word transliteration accuracy (WTA) of 93 % and a character error rate (CER) of 4.73 % on unseen Moroccan Arabizi data, highlighting the potential of transformer models to improve transliteration accuracy for low-resource languages, particularly MD.</div></div>","PeriodicalId":100699,"journal":{"name":"International Journal of Information Management Data Insights","volume":"5 2","pages":"Article 100351"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-based model for moroccan Arabizi-to-Arabic transliteration using a semi-automatic annotated dataset\",\"authors\":\"Soufiane Hajbi , Omayma Amezian , Mouhssine Ziyad , Issame El Kaime , Redouan Korchyine , Younes Chihab\",\"doi\":\"10.1016/j.jjimei.2025.100351\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Language models have recently achieved state-of-the-art results in tasks such as translation, sentiment analysis, and text classification for high-resource languages. However, dedicated models for low-resource languages remain scarce, largely due to a lack of annotated data and linguistic resources. Most efforts focus on fine-tuning models trained on high-resource languages using limited data, resulting in a substantial performance gap. Moroccan Darija (MD), widely spoken in Morocco, lacks language resources and dedicated models. Additionally, MD texts often employ the Arabizi writing form, which combines Latin characters and numbers with Arabic script, further complicating Natural Language Processing (NLP) tasks. This work presents the first transformer-based model designed specifically for transliterating Moroccan Arabizi to Arabic. The approach leverages a character-level modeling architecture and a semi-automatically generated dataset containing over 33k word pairs, capturing significant linguistic diversity. The model achieves a state-of-the-art word transliteration accuracy (WTA) of 93 % and a character error rate (CER) of 4.73 % on unseen Moroccan Arabizi data, highlighting the potential of transformer models to improve transliteration accuracy for low-resource languages, particularly MD.</div></div>\",\"PeriodicalId\":100699,\"journal\":{\"name\":\"International Journal of Information Management Data Insights\",\"volume\":\"5 2\",\"pages\":\"Article 100351\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Information Management Data Insights\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2667096825000333\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Management Data Insights","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667096825000333","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Transformer-based model for moroccan Arabizi-to-Arabic transliteration using a semi-automatic annotated dataset
Language models have recently achieved state-of-the-art results in tasks such as translation, sentiment analysis, and text classification for high-resource languages. However, dedicated models for low-resource languages remain scarce, largely due to a lack of annotated data and linguistic resources. Most efforts focus on fine-tuning models trained on high-resource languages using limited data, resulting in a substantial performance gap. Moroccan Darija (MD), widely spoken in Morocco, lacks language resources and dedicated models. Additionally, MD texts often employ the Arabizi writing form, which combines Latin characters and numbers with Arabic script, further complicating Natural Language Processing (NLP) tasks. This work presents the first transformer-based model designed specifically for transliterating Moroccan Arabizi to Arabic. The approach leverages a character-level modeling architecture and a semi-automatically generated dataset containing over 33k word pairs, capturing significant linguistic diversity. The model achieves a state-of-the-art word transliteration accuracy (WTA) of 93 % and a character error rate (CER) of 4.73 % on unseen Moroccan Arabizi data, highlighting the potential of transformer models to improve transliteration accuracy for low-resource languages, particularly MD.