{"title":"Fine-Tashkeel: Fine-Tuning Byte-Level Models for Accurate Arabic Text Diacritization","authors":"Bashar Al-Rfooh, Gheith A. Abandah, Rami Al-Rfou","doi":"10.1109/JEEIT58638.2023.10185725","DOIUrl":null,"url":null,"abstract":"Most of previous work on learning diacritization of the Arabic language relied on training models from scratch. In this paper, we investigate how to leverage pre-trained language models to learn diacritization. We fine-tune token-free pre-trained multilingual models (ByT5) to learn to predict and insert missing diacritics in Arabic text, a complex task that requires understanding the sentence semantics and the morphological structure of the tokens. We achieve state-of-the-art accuracy on the dia-critization task with minimal amount of training and no feature engineering, reducing WER (word error rate) by 40%. We release our fine-tuned models for the greater benefit of the researchers in the community.","PeriodicalId":177556,"journal":{"name":"2023 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JEEIT58638.2023.10185725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Most of previous work on learning diacritization of the Arabic language relied on training models from scratch. In this paper, we investigate how to leverage pre-trained language models to learn diacritization. We fine-tune token-free pre-trained multilingual models (ByT5) to learn to predict and insert missing diacritics in Arabic text, a complex task that requires understanding the sentence semantics and the morphological structure of the tokens. We achieve state-of-the-art accuracy on the dia-critization task with minimal amount of training and no feature engineering, reducing WER (word error rate) by 40%. We release our fine-tuned models for the greater benefit of the researchers in the community.