Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Journal for Language Technology and Computational Linguistics Pub Date : 2023-06-21 DOI:10.21248/jlcl.36.2023.243

Barack Wanjawa, Lilian Wanzare, Florence Indede, Owen McOnyango, Edward Ombui, Lawrence Muchemi

{"title":"Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks","authors":"Barack Wanjawa, Lilian Wanzare, Florence Indede, Owen McOnyango, Edward Ombui, Lawrence Muchemi","doi":"10.21248/jlcl.36.2023.243","DOIUrl":null,"url":null,"abstract":"Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya (three dialects of Lumarachi, Lulogooli and Lubukusu). Data collection was done by researchers who were deployed to the various data collection sources such as communities, schools, media, and publishers. The Kencorpus' dataset has a collection of 5,594 items, being 4,442 texts of 5.6 million words and 1,152 speech files worth 177 hours. Based on this data, other datasets were also developed such as Part of Speech tagging sets for Dholuo and the Luhya dialects of 50,000 and 93,000 words tagged respectively. We developed 7,537 Question-Answer pairs from 1,445 Swahili texts and also created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. Additionally, we developed two proof of concept systems: for Kiswahili speech-to-text and a machine learning system for Question Answering task. These proofs provided results of a performance of 18.87% word error rate for the former, and 80% Exact Match (EM) for the latter system. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages. Challenges in developing the corpus included deficiencies in the data sources, data cleaning challenges, relatively short project timelines and the Coronavirus disease (COVID-19) pandemic that restricted movement and hence the ability to get the data in a timely manner.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal for Language Technology and Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.36.2023.243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya (three dialects of Lumarachi, Lulogooli and Lubukusu). Data collection was done by researchers who were deployed to the various data collection sources such as communities, schools, media, and publishers. The Kencorpus' dataset has a collection of 5,594 items, being 4,442 texts of 5.6 million words and 1,152 speech files worth 177 hours. Based on this data, other datasets were also developed such as Part of Speech tagging sets for Dholuo and the Luhya dialects of 50,000 and 93,000 words tagged respectively. We developed 7,537 Question-Answer pairs from 1,445 Swahili texts and also created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. Additionally, we developed two proof of concept systems: for Kiswahili speech-to-text and a machine learning system for Question Answering task. These proofs provided results of a performance of 18.87% word error rate for the former, and 80% Exact Match (EM) for the latter system. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages. Challenges in developing the corpus included deficiencies in the data sources, data cleaning challenges, relatively short project timelines and the Coronavirus disease (COVID-19) pandemic that restricted movement and hence the ability to get the data in a timely manner.

查看原文本刊更多论文

肯尼亚语料库:用于自然语言处理任务的斯瓦希里语、多洛语和卢希亚语肯尼亚语料库

非洲土著语言在自然语言处理中被归类为服务不足。因此，他们的数字包容性和信息获取能力较差。这些语言的处理挑战是如何在没有必要数据的情况下使用机器学习和深度学习模型。Kencorpus项目旨在通过收集和存储文本和语音数据来弥补这一差距，这些数据足以用于机器翻译、多语言社区的问答和转录等应用程序的数据驱动解决方案。Kencorpus数据集是肯尼亚主要使用的三种语言的文本和语音语料库:斯瓦希里语、Dholuo语和Luhya语(卢马拉奇语、卢洛古利语和卢布库苏语的三种方言)。数据收集由研究人员完成，他们被部署到各种数据收集来源，如社区、学校、媒体和出版商。Kencorpus的数据集收集了5594个项目，包括5,560万字的4,442个文本和1,152个语音文件，价值177小时。在此基础上，还开发了Dholuo和Luhya方言词性标注集，分别有5万个和9.3万个词性标注词。我们从1445个斯瓦希里语文本中开发了7537个问答对，并创建了一个包含13400个从Dholuo和Luhya语到斯瓦希里语的文本翻译集。这些数据集对于下游的机器学习任务很有用，比如模型训练和翻译。此外，我们开发了两个概念验证系统:用于斯瓦希里语语音到文本的系统和用于问答任务的机器学习系统。这些证明的结果表明，前者的单词错误率为18.87%，后者的精确匹配率为80%。这些初步结果为Kencorpus的可用性给机器学习社区带来了巨大的希望。Kencorpus是这三种低资源语言的少数公共领域语料库之一，为类似的工作特别是低资源语言的学习和分享经验奠定了基础。开发语料库的挑战包括数据源不足、数据清理方面的挑战、项目时间表相对较短以及限制人员流动的冠状病毒(COVID-19)大流行，因此无法及时获取数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal for Language Technology and Computational Linguistics

自引率

0.00%

发文量