The effects of a corpus on isiZulu spellcheckers based on N-grams

2016 IST-Africa Week Conference Pub Date : 2016-05-11 DOI:10.1109/ISTAFRICA.2016.7530643

Balone Ndaba, H. Suleman, C. Keet, Langa Khumalo

{"title":"The effects of a corpus on isiZulu spellcheckers based on N-grams","authors":"Balone Ndaba, H. Suleman, C. Keet, Langa Khumalo","doi":"10.1109/ISTAFRICA.2016.7530643","DOIUrl":null,"url":null,"abstract":"Correct spelling contributes to good content accessibility and readability for textual documents. However, there are few spellcheckers for Bantu languages such as isiZulu, the major language in South Africa. The objective of this research is to investigate development of spellcheckers for isiZulu and, more generally, an approach that can be reused across Bantu languages. To fill this gap in an extensible way, we used data-driven statistical language models with trigrams and quadrigrams. The models were trained on three different isiZulu corpora, being Ukwabelana, a selection of the isiZulu National Corpus, and a small corpus of news items. The system performed better with trigrams than with quadrigrams, and performance depended on the training and testing corpora. When the system was trained with old text (bible in isiZulu), it did not perform well when tested with the two corpora that contain more recent texts, such as the constitution and news items. The highest accuracy obtained was 89%. Given that data-driven statistical language models constitute a language-independent approach, we conclude that data-driven spellcheckers for all Bantu languages are indeed feasible. They are, however, sensitive to the training and testing data. This is less resource-intensive compared to manual specification of rules, and therefore the potential impact on realising spellcheckers for Bantu languages is now practically within reach. The potential societal impact of spellchecker-supported tools and apps is incalculable.","PeriodicalId":326074,"journal":{"name":"2016 IST-Africa Week Conference","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IST-Africa Week Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISTAFRICA.2016.7530643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

Correct spelling contributes to good content accessibility and readability for textual documents. However, there are few spellcheckers for Bantu languages such as isiZulu, the major language in South Africa. The objective of this research is to investigate development of spellcheckers for isiZulu and, more generally, an approach that can be reused across Bantu languages. To fill this gap in an extensible way, we used data-driven statistical language models with trigrams and quadrigrams. The models were trained on three different isiZulu corpora, being Ukwabelana, a selection of the isiZulu National Corpus, and a small corpus of news items. The system performed better with trigrams than with quadrigrams, and performance depended on the training and testing corpora. When the system was trained with old text (bible in isiZulu), it did not perform well when tested with the two corpora that contain more recent texts, such as the constitution and news items. The highest accuracy obtained was 89%. Given that data-driven statistical language models constitute a language-independent approach, we conclude that data-driven spellcheckers for all Bantu languages are indeed feasible. They are, however, sensitive to the training and testing data. This is less resource-intensive compared to manual specification of rules, and therefore the potential impact on realising spellcheckers for Bantu languages is now practically within reach. The potential societal impact of spellchecker-supported tools and apps is incalculable.

查看原文本刊更多论文

语料库对基于n -gram的isiZulu拼写检查器的影响

正确的拼写有助于文本文档的内容可访问性和可读性。然而，班图语的拼写检查器很少，比如南非的主要语言isiZulu。本研究的目的是调查isiZulu的拼写检查器的发展，更一般地说，是一种可以跨班图语言重用的方法。为了以一种可扩展的方式填补这一空白，我们使用了数据驱动的统计语言模型，其中包含三元组和四元组。这些模型在三个不同的isiZulu语料库上进行了训练，分别是Ukwabelana, isiZulu国家语料库的一个选择，以及一个小的新闻项目语料库。该系统在使用三元组时比使用四元组时表现更好，并且性能取决于训练和测试语料库。当系统用旧文本(isiZulu语的《圣经》)进行训练时，它在用两种包含较新文本(如宪法和新闻)的语料库进行测试时表现不佳。获得的最高准确率为89%。考虑到数据驱动的统计语言模型构成了一种与语言无关的方法，我们得出结论，所有班图语言的数据驱动拼写检查器确实是可行的。然而，他们对训练和测试数据很敏感。与手动规范规则相比，这节省了大量资源，因此对班图语拼写检查器的潜在影响现在已经触手可及。支持拼写检查的工具和应用程序的潜在社会影响是不可估量的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IST-Africa Week Conference

自引率

0.00%

发文量