Fatma Zahra Besdouri, A. Mekki, Inès Zribi, M. Ellouze
{"title":"Improvement of the COTA-Orthography system through language modeling","authors":"Fatma Zahra Besdouri, A. Mekki, Inès Zribi, M. Ellouze","doi":"10.1109/AICCSA53542.2021.9686898","DOIUrl":null,"url":null,"abstract":"The lack of a single standard orthography causes multiple forms of writing. This orthographic inconsistency is a frequent issue for Natural Language Processing (NLP). In this paper, we present a contextual method based on the orthography convention CODA-TUN [34] to improve the semi-automatic normalization tool, COTA Orthography [7], [25]. Our method targets words having multiple possible corrections which are semi-treated by this system. Therefore, we trained and improved a trigram language model based on a large corpus. We introduced, also, a generative algorithm to retrieve candidates for sentence having the target words. The selection of the correct correction is based on the trigram model. The evaluation results show that the selection accuracy reaches 79.38%.","PeriodicalId":423896,"journal":{"name":"2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICCSA53542.2021.9686898","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The lack of a single standard orthography causes multiple forms of writing. This orthographic inconsistency is a frequent issue for Natural Language Processing (NLP). In this paper, we present a contextual method based on the orthography convention CODA-TUN [34] to improve the semi-automatic normalization tool, COTA Orthography [7], [25]. Our method targets words having multiple possible corrections which are semi-treated by this system. Therefore, we trained and improved a trigram language model based on a large corpus. We introduced, also, a generative algorithm to retrieve candidates for sentence having the target words. The selection of the correct correction is based on the trigram model. The evaluation results show that the selection accuracy reaches 79.38%.