A. Mekki, Inès Zribi, M. Ellouze, Lamia Hadrich Belguith
{"title":"COTA 2.0:突尼斯阿拉伯语社交媒体文本的自动校正器","authors":"A. Mekki, Inès Zribi, M. Ellouze, Lamia Hadrich Belguith","doi":"10.5455/jjcit.71-1655499240","DOIUrl":null,"url":null,"abstract":"In written text, orthographic noise is a common concern for NLP, especially when operating social network comments and raw documents. This is mainly due to its orthographic conventions and morphological ambiguity. We propose to automatically normalize the social media dialect corpora by following CODA-TUN, the Conventional Orthography for Tunisian Arabic (TA). The existing system developed for TA <<COTA Orthography 1.0>> is not able to handle all forms of TA. Therefore, we propose to extend its rules and lexicons to address the peculiarities of social media dialect. In certain words, the COTA Orthography 1.0 system provides the user with several correction possibilities. Therefore, in the new version, we incorporated a trigram language model to automatically select the right correction. Our results show that the system can reduce transcription errors by 95.72%.","PeriodicalId":36757,"journal":{"name":"Jordanian Journal of Computers and Information Technology","volume":"1 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"COTA 2.0: an Automatic Corrector of Tunisian Arabic Social Media Texts\",\"authors\":\"A. Mekki, Inès Zribi, M. Ellouze, Lamia Hadrich Belguith\",\"doi\":\"10.5455/jjcit.71-1655499240\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In written text, orthographic noise is a common concern for NLP, especially when operating social network comments and raw documents. This is mainly due to its orthographic conventions and morphological ambiguity. We propose to automatically normalize the social media dialect corpora by following CODA-TUN, the Conventional Orthography for Tunisian Arabic (TA). The existing system developed for TA <<COTA Orthography 1.0>> is not able to handle all forms of TA. Therefore, we propose to extend its rules and lexicons to address the peculiarities of social media dialect. In certain words, the COTA Orthography 1.0 system provides the user with several correction possibilities. Therefore, in the new version, we incorporated a trigram language model to automatically select the right correction. Our results show that the system can reduce transcription errors by 95.72%.\",\"PeriodicalId\":36757,\"journal\":{\"name\":\"Jordanian Journal of Computers and Information Technology\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Jordanian Journal of Computers and Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5455/jjcit.71-1655499240\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jordanian Journal of Computers and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5455/jjcit.71-1655499240","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
COTA 2.0: an Automatic Corrector of Tunisian Arabic Social Media Texts
In written text, orthographic noise is a common concern for NLP, especially when operating social network comments and raw documents. This is mainly due to its orthographic conventions and morphological ambiguity. We propose to automatically normalize the social media dialect corpora by following CODA-TUN, the Conventional Orthography for Tunisian Arabic (TA). The existing system developed for TA <> is not able to handle all forms of TA. Therefore, we propose to extend its rules and lexicons to address the peculiarities of social media dialect. In certain words, the COTA Orthography 1.0 system provides the user with several correction possibilities. Therefore, in the new version, we incorporated a trigram language model to automatically select the right correction. Our results show that the system can reduce transcription errors by 95.72%.