Amharic Text Corpus based on Parts of Speech tagging and headwords

2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA) Pub Date : 2021-11-22 DOI:10.1109/ict4da53266.2021.9672246

T. Abebe, E. Alemneh

{"title":"Amharic Text Corpus based on Parts of Speech tagging and headwords","authors":"T. Abebe, E. Alemneh","doi":"10.1109/ict4da53266.2021.9672246","DOIUrl":null,"url":null,"abstract":"Corpus is a milestone to study natural languages and to develop various tools for the processing of human languages. Since, few studies are carried out on the development of Amharic language corpus development, the existing corpora are very small in size and not well accessible for academicians as well as commercial and non-commercial organizations. This paper presents Amharic text corpus developed by applying the processes of annotating each word with its part of speech tag and reducing each orthographic word to its headword using either derivational or inflectional process. We extracted 12,720 sentences from various text documents collected in the domain of proclamations. Ethiopian 1987 E.C constitution and a few policies of Amhara regional state, Ethiopia and federal government of Ethiopia are some of those documents. We found 331,728 tokens from those sentences. 66 tag sets are compiled from base part of speech tag set classes and compound part of speech tag set classes based on different factors and representation of orthographic words. To help the manual annotation of each orthographic word, we developed a semi-automatic Amharic text tagger. The outputs of the research project are pre-processed Amharic text stored in plain text format and tagged Amharic text corpus encoded with extensible markup language format. The tag sets of annotated text corpora are represented in both Ge'ez script and English characters. We plan to increase the number of tag sets and size of text corpus in the near future. Moreover, we are working towards converting the semi-automatic Amharic text tagger to full automation.","PeriodicalId":371663,"journal":{"name":"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ict4da53266.2021.9672246","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Corpus is a milestone to study natural languages and to develop various tools for the processing of human languages. Since, few studies are carried out on the development of Amharic language corpus development, the existing corpora are very small in size and not well accessible for academicians as well as commercial and non-commercial organizations. This paper presents Amharic text corpus developed by applying the processes of annotating each word with its part of speech tag and reducing each orthographic word to its headword using either derivational or inflectional process. We extracted 12,720 sentences from various text documents collected in the domain of proclamations. Ethiopian 1987 E.C constitution and a few policies of Amhara regional state, Ethiopia and federal government of Ethiopia are some of those documents. We found 331,728 tokens from those sentences. 66 tag sets are compiled from base part of speech tag set classes and compound part of speech tag set classes based on different factors and representation of orthographic words. To help the manual annotation of each orthographic word, we developed a semi-automatic Amharic text tagger. The outputs of the research project are pre-processed Amharic text stored in plain text format and tagged Amharic text corpus encoded with extensible markup language format. The tag sets of annotated text corpora are represented in both Ge'ez script and English characters. We plan to increase the number of tag sets and size of text corpus in the near future. Moreover, we are working towards converting the semi-automatic Amharic text tagger to full automation.

查看原文本刊更多论文

基于词性标注和主题词的阿姆哈拉语文本语料库

语料库是研究自然语言和开发各种人类语言处理工具的一个里程碑。由于对阿姆哈拉语语料库开发的研究很少，现有的语料库规模很小，学术界以及商业和非商业组织都无法很好地获取。本文介绍了阿姆哈拉语文本语料库的开发过程中，应用的注释过程，每个词的词性标签和减少每个正字法的词，以它的词头使用衍生或屈折过程。我们从公告领域收集的各种文本文档中提取了12,720个句子。1987年欧共体宪法和阿姆哈拉地区国家的一些政策，埃塞俄比亚和埃塞俄比亚联邦政府是其中的一些文件。我们从这些句子中找到了331728个标记。根据不同的因素和正字法词的表示，从语音标签集类的基部分和语音标签集类的复合部分编译了66个标签集。为了帮助手动标注每个正字法单词，我们开发了一个半自动阿姆哈拉语文本标注器。研究项目的输出是以纯文本格式存储的预处理阿姆哈拉文文本和用可扩展标记语言格式编码的标记阿姆哈拉文文本语料库。标注文本语料库的标签集分别用葛孜文字和英文文字表示。我们计划在不久的将来增加标签集的数量和文本语料库的大小。此外，我们正在努力将半自动阿姆哈拉语文本标注器转换为全自动。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)

自引率

0.00%

发文量