{"title":"基于词性标注和主题词的阿姆哈拉语文本语料库","authors":"T. Abebe, E. Alemneh","doi":"10.1109/ict4da53266.2021.9672246","DOIUrl":null,"url":null,"abstract":"Corpus is a milestone to study natural languages and to develop various tools for the processing of human languages. Since, few studies are carried out on the development of Amharic language corpus development, the existing corpora are very small in size and not well accessible for academicians as well as commercial and non-commercial organizations. This paper presents Amharic text corpus developed by applying the processes of annotating each word with its part of speech tag and reducing each orthographic word to its headword using either derivational or inflectional process. We extracted 12,720 sentences from various text documents collected in the domain of proclamations. Ethiopian 1987 E.C constitution and a few policies of Amhara regional state, Ethiopia and federal government of Ethiopia are some of those documents. We found 331,728 tokens from those sentences. 66 tag sets are compiled from base part of speech tag set classes and compound part of speech tag set classes based on different factors and representation of orthographic words. To help the manual annotation of each orthographic word, we developed a semi-automatic Amharic text tagger. The outputs of the research project are pre-processed Amharic text stored in plain text format and tagged Amharic text corpus encoded with extensible markup language format. The tag sets of annotated text corpora are represented in both Ge'ez script and English characters. We plan to increase the number of tag sets and size of text corpus in the near future. Moreover, we are working towards converting the semi-automatic Amharic text tagger to full automation.","PeriodicalId":371663,"journal":{"name":"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Amharic Text Corpus based on Parts of Speech tagging and headwords\",\"authors\":\"T. Abebe, E. Alemneh\",\"doi\":\"10.1109/ict4da53266.2021.9672246\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Corpus is a milestone to study natural languages and to develop various tools for the processing of human languages. Since, few studies are carried out on the development of Amharic language corpus development, the existing corpora are very small in size and not well accessible for academicians as well as commercial and non-commercial organizations. This paper presents Amharic text corpus developed by applying the processes of annotating each word with its part of speech tag and reducing each orthographic word to its headword using either derivational or inflectional process. We extracted 12,720 sentences from various text documents collected in the domain of proclamations. Ethiopian 1987 E.C constitution and a few policies of Amhara regional state, Ethiopia and federal government of Ethiopia are some of those documents. We found 331,728 tokens from those sentences. 66 tag sets are compiled from base part of speech tag set classes and compound part of speech tag set classes based on different factors and representation of orthographic words. To help the manual annotation of each orthographic word, we developed a semi-automatic Amharic text tagger. The outputs of the research project are pre-processed Amharic text stored in plain text format and tagged Amharic text corpus encoded with extensible markup language format. The tag sets of annotated text corpora are represented in both Ge'ez script and English characters. We plan to increase the number of tag sets and size of text corpus in the near future. Moreover, we are working towards converting the semi-automatic Amharic text tagger to full automation.\",\"PeriodicalId\":371663,\"journal\":{\"name\":\"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ict4da53266.2021.9672246\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ict4da53266.2021.9672246","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Amharic Text Corpus based on Parts of Speech tagging and headwords
Corpus is a milestone to study natural languages and to develop various tools for the processing of human languages. Since, few studies are carried out on the development of Amharic language corpus development, the existing corpora are very small in size and not well accessible for academicians as well as commercial and non-commercial organizations. This paper presents Amharic text corpus developed by applying the processes of annotating each word with its part of speech tag and reducing each orthographic word to its headword using either derivational or inflectional process. We extracted 12,720 sentences from various text documents collected in the domain of proclamations. Ethiopian 1987 E.C constitution and a few policies of Amhara regional state, Ethiopia and federal government of Ethiopia are some of those documents. We found 331,728 tokens from those sentences. 66 tag sets are compiled from base part of speech tag set classes and compound part of speech tag set classes based on different factors and representation of orthographic words. To help the manual annotation of each orthographic word, we developed a semi-automatic Amharic text tagger. The outputs of the research project are pre-processed Amharic text stored in plain text format and tagged Amharic text corpus encoded with extensible markup language format. The tag sets of annotated text corpora are represented in both Ge'ez script and English characters. We plan to increase the number of tag sets and size of text corpus in the near future. Moreover, we are working towards converting the semi-automatic Amharic text tagger to full automation.