{"title":"Word-Based Bantu Language Identification using Naïve Bayes","authors":"Boago Okgetheng, Emmanuella Budu","doi":"10.23919/IST-Africa56635.2022.9845618","DOIUrl":null,"url":null,"abstract":"Language identification of text has become increasingly important as large quantities of text are processed or filtered automatically. It is one of the preprocessing steps in Natural Language Processing (NLP) tasks such as information retrieval and machine translation. Few studies have worked on Bantu Languages in automatic language identification. Language identification is a challenge in Bantu languages because of lack of data and in addition to that, languages which are written similarly like Setswana and Sesotho are also challenging. In this paper, we present a word-based Naïve Bayes classifier to identify words of Sesotho and Setswana language. The classifier was trained with words from both Setswana and Sesotho in a supervised manner. Adjectives, pronouns, adverbs and enumeratives are also included. The classifier shows that the two languages can be individually identified as it gives an accuracy of 71.4%. Despite that when we increase the data to double the number of words, the model increased performance to 78%. We also report that the classifier fails with homographs. The performance could be improved by using more data. Additionally, the syllable identification and sentence identification could be implemented along with word-based classifier.","PeriodicalId":142887,"journal":{"name":"2022 IST-Africa Conference (IST-Africa)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IST-Africa Conference (IST-Africa)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/IST-Africa56635.2022.9845618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Language identification of text has become increasingly important as large quantities of text are processed or filtered automatically. It is one of the preprocessing steps in Natural Language Processing (NLP) tasks such as information retrieval and machine translation. Few studies have worked on Bantu Languages in automatic language identification. Language identification is a challenge in Bantu languages because of lack of data and in addition to that, languages which are written similarly like Setswana and Sesotho are also challenging. In this paper, we present a word-based Naïve Bayes classifier to identify words of Sesotho and Setswana language. The classifier was trained with words from both Setswana and Sesotho in a supervised manner. Adjectives, pronouns, adverbs and enumeratives are also included. The classifier shows that the two languages can be individually identified as it gives an accuracy of 71.4%. Despite that when we increase the data to double the number of words, the model increased performance to 78%. We also report that the classifier fails with homographs. The performance could be improved by using more data. Additionally, the syllable identification and sentence identification could be implemented along with word-based classifier.