Usable Amharic text corpus for natural language processing applications

Applied Corpus Linguistics Pub Date : 2022-12-01 DOI:10.1016/j.acorp.2022.100033

Michael Melese Woldeyohannis, Million Meshesha

{"title":"Usable Amharic text corpus for natural language processing applications","authors":"Michael Melese Woldeyohannis, Million Meshesha","doi":"10.1016/j.acorp.2022.100033","DOIUrl":null,"url":null,"abstract":"<div><p>In this paper, we describe the preparation of a usable Amharic text corpus for different Natural Language Processing (NLP) applications. Natural language applications, such as document classification, topic modeling, machine translation, speech recognition, and others, suffer greatly from a lack of digital resources. This is especially true for Amharic, a resource-constrained, morphologically rich, and complex language. In response to this, a total of 67,739 Amharic news documents consisting of 8 different categories from online sources are collected. The collected corpus passes through a number of pre-processing steps including; data cleaning, text normalization and punctuation correction. To validate the usability of the collected corpora from different domains, a baseline document classification experiment was conducted. Experimental results show that, 84.53% accuracy is registered using deep learning in the absence of linguistic information. Finding indicated that it is possible to use the prepared corpora for different natural language applications in the absence of linguistic resources such as stemmer and dictionary despite the complexity of Amharic language. We are further working towards Amharic news document classification by incorporating a linguistic independent stop-word detection, stemming and unsupervised morphological segmentation of Amharic documents.</p></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"2 3","pages":"Article 100033"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Corpus Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266679912200017X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we describe the preparation of a usable Amharic text corpus for different Natural Language Processing (NLP) applications. Natural language applications, such as document classification, topic modeling, machine translation, speech recognition, and others, suffer greatly from a lack of digital resources. This is especially true for Amharic, a resource-constrained, morphologically rich, and complex language. In response to this, a total of 67,739 Amharic news documents consisting of 8 different categories from online sources are collected. The collected corpus passes through a number of pre-processing steps including; data cleaning, text normalization and punctuation correction. To validate the usability of the collected corpora from different domains, a baseline document classification experiment was conducted. Experimental results show that, 84.53% accuracy is registered using deep learning in the absence of linguistic information. Finding indicated that it is possible to use the prepared corpora for different natural language applications in the absence of linguistic resources such as stemmer and dictionary despite the complexity of Amharic language. We are further working towards Amharic news document classification by incorporating a linguistic independent stop-word detection, stemming and unsupervised morphological segmentation of Amharic documents.

查看原文本刊更多论文

可用的阿姆哈拉语文本语料库用于自然语言处理应用程序

在本文中，我们描述了为不同的自然语言处理(NLP)应用程序准备一个可用的阿姆哈拉语文本语料库。自然语言应用程序，如文档分类、主题建模、机器翻译、语音识别等，由于缺乏数字资源而受到严重影响。阿姆哈拉语尤其如此，它是一种资源受限、形态丰富且复杂的语言。为此，我们从网上收集了8个不同类别的阿姆哈拉语新闻文档，共67,739份。所收集的语料库经过若干预处理步骤，包括;数据清理，文本规范化和标点纠正。为了验证从不同领域收集的语料库的可用性，进行了基线文档分类实验。实验结果表明，在缺乏语言信息的情况下，使用深度学习注册的准确率为84.53%。研究结果表明，尽管阿姆哈拉语本身很复杂，但在缺乏词干和词典等语言资源的情况下，将准备好的语料库用于不同的自然语言应用是可能的。我们正在进一步致力于阿姆哈拉语新闻文档分类，包括独立于语言的停止词检测、词干提取和阿姆哈拉语文档的无监督形态分割。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊