{"title":"The effects of a corpus on isiZulu spellcheckers based on N-grams","authors":"Balone Ndaba, H. Suleman, C. Keet, Langa Khumalo","doi":"10.1109/ISTAFRICA.2016.7530643","DOIUrl":null,"url":null,"abstract":"Correct spelling contributes to good content accessibility and readability for textual documents. However, there are few spellcheckers for Bantu languages such as isiZulu, the major language in South Africa. The objective of this research is to investigate development of spellcheckers for isiZulu and, more generally, an approach that can be reused across Bantu languages. To fill this gap in an extensible way, we used data-driven statistical language models with trigrams and quadrigrams. The models were trained on three different isiZulu corpora, being Ukwabelana, a selection of the isiZulu National Corpus, and a small corpus of news items. The system performed better with trigrams than with quadrigrams, and performance depended on the training and testing corpora. When the system was trained with old text (bible in isiZulu), it did not perform well when tested with the two corpora that contain more recent texts, such as the constitution and news items. The highest accuracy obtained was 89%. Given that data-driven statistical language models constitute a language-independent approach, we conclude that data-driven spellcheckers for all Bantu languages are indeed feasible. They are, however, sensitive to the training and testing data. This is less resource-intensive compared to manual specification of rules, and therefore the potential impact on realising spellcheckers for Bantu languages is now practically within reach. The potential societal impact of spellchecker-supported tools and apps is incalculable.","PeriodicalId":326074,"journal":{"name":"2016 IST-Africa Week Conference","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IST-Africa Week Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISTAFRICA.2016.7530643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29
Abstract
Correct spelling contributes to good content accessibility and readability for textual documents. However, there are few spellcheckers for Bantu languages such as isiZulu, the major language in South Africa. The objective of this research is to investigate development of spellcheckers for isiZulu and, more generally, an approach that can be reused across Bantu languages. To fill this gap in an extensible way, we used data-driven statistical language models with trigrams and quadrigrams. The models were trained on three different isiZulu corpora, being Ukwabelana, a selection of the isiZulu National Corpus, and a small corpus of news items. The system performed better with trigrams than with quadrigrams, and performance depended on the training and testing corpora. When the system was trained with old text (bible in isiZulu), it did not perform well when tested with the two corpora that contain more recent texts, such as the constitution and news items. The highest accuracy obtained was 89%. Given that data-driven statistical language models constitute a language-independent approach, we conclude that data-driven spellcheckers for all Bantu languages are indeed feasible. They are, however, sensitive to the training and testing data. This is less resource-intensive compared to manual specification of rules, and therefore the potential impact on realising spellcheckers for Bantu languages is now practically within reach. The potential societal impact of spellchecker-supported tools and apps is incalculable.