{"title":"基于资源稀缺语言训练语料库质量的自动变音符恢复","authors":"I. I. Ayogu, Onoja Abu","doi":"10.1109/CYBERNIGERIA51635.2021.9428872","DOIUrl":null,"url":null,"abstract":"The development and availability of high quality corpus for many African languages is still hampered by dearth of appropriate software tools and devices. To be able to rapidly create large quantities of high quality corpus of majority of African and Nigerian languages, a diacritic tool is required. The presentation of texts of natural languages without diacritic marks presents significant problems to both human and computational processing systems due to partial or total loss of the accompanying grammatical, syntactic and or semantic information. This paper investigated the effect of diacritic quality of a small-sized training corpus on the classification accuracy of some simple and commonly used machine learning algorithms for diacritic restoration tasks following the character-based approach. The classification accuracy of eight of the diacritic-bearing characters of Yoruba language of Nigeria were investigated. The results show that the completeness and correctness of diacritics has a significant effect on the performance of the algorithms; decision tree algorithm produced the overall best accuracy response of 3.22 % to the data quality improvement. The observations from the learning behaviours of the algorithms suggests that a 100,000 words corpus is adequate to train a decision tree model for automatic diacritic restoration for Yoruba language but insufficient to obtain a state-of-the art results for the LDA, LOGREG and SVM algorithms.","PeriodicalId":208301,"journal":{"name":"2020 IEEE 2nd International Conference on Cyberspac (CYBER NIGERIA)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic Diacritic Recovery with focus on the Quality of the training Corpus for Resource-scarce Languages\",\"authors\":\"I. I. Ayogu, Onoja Abu\",\"doi\":\"10.1109/CYBERNIGERIA51635.2021.9428872\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The development and availability of high quality corpus for many African languages is still hampered by dearth of appropriate software tools and devices. To be able to rapidly create large quantities of high quality corpus of majority of African and Nigerian languages, a diacritic tool is required. The presentation of texts of natural languages without diacritic marks presents significant problems to both human and computational processing systems due to partial or total loss of the accompanying grammatical, syntactic and or semantic information. This paper investigated the effect of diacritic quality of a small-sized training corpus on the classification accuracy of some simple and commonly used machine learning algorithms for diacritic restoration tasks following the character-based approach. The classification accuracy of eight of the diacritic-bearing characters of Yoruba language of Nigeria were investigated. The results show that the completeness and correctness of diacritics has a significant effect on the performance of the algorithms; decision tree algorithm produced the overall best accuracy response of 3.22 % to the data quality improvement. The observations from the learning behaviours of the algorithms suggests that a 100,000 words corpus is adequate to train a decision tree model for automatic diacritic restoration for Yoruba language but insufficient to obtain a state-of-the art results for the LDA, LOGREG and SVM algorithms.\",\"PeriodicalId\":208301,\"journal\":{\"name\":\"2020 IEEE 2nd International Conference on Cyberspac (CYBER NIGERIA)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 2nd International Conference on Cyberspac (CYBER NIGERIA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CYBERNIGERIA51635.2021.9428872\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 2nd International Conference on Cyberspac (CYBER NIGERIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CYBERNIGERIA51635.2021.9428872","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Automatic Diacritic Recovery with focus on the Quality of the training Corpus for Resource-scarce Languages
The development and availability of high quality corpus for many African languages is still hampered by dearth of appropriate software tools and devices. To be able to rapidly create large quantities of high quality corpus of majority of African and Nigerian languages, a diacritic tool is required. The presentation of texts of natural languages without diacritic marks presents significant problems to both human and computational processing systems due to partial or total loss of the accompanying grammatical, syntactic and or semantic information. This paper investigated the effect of diacritic quality of a small-sized training corpus on the classification accuracy of some simple and commonly used machine learning algorithms for diacritic restoration tasks following the character-based approach. The classification accuracy of eight of the diacritic-bearing characters of Yoruba language of Nigeria were investigated. The results show that the completeness and correctness of diacritics has a significant effect on the performance of the algorithms; decision tree algorithm produced the overall best accuracy response of 3.22 % to the data quality improvement. The observations from the learning behaviours of the algorithms suggests that a 100,000 words corpus is adequate to train a decision tree model for automatic diacritic restoration for Yoruba language but insufficient to obtain a state-of-the art results for the LDA, LOGREG and SVM algorithms.