为低资源语言构建词典:无监督学习的挑战

Q2 Computer Science

Annals of Emerging Technologies in Computing Pub Date : 2021-07-01 DOI:10.33166/AETIC.2021.03.005

D. Mati, Mentor Hamiti, Arsim Susuri, B. Selimi, Jaumin Ajdari

{"title":"为低资源语言构建词典:无监督学习的挑战","authors":"D. Mati, Mentor Hamiti, Arsim Susuri, B. Selimi, Jaumin Ajdari","doi":"10.33166/AETIC.2021.03.005","DOIUrl":null,"url":null,"abstract":"The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.","PeriodicalId":36440,"journal":{"name":"Annals of Emerging Technologies in Computing","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning\",\"authors\":\"D. Mati, Mentor Hamiti, Arsim Susuri, B. Selimi, Jaumin Ajdari\",\"doi\":\"10.33166/AETIC.2021.03.005\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.\",\"PeriodicalId\":36440,\"journal\":{\"name\":\"Annals of Emerging Technologies in Computing\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Emerging Technologies in Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.33166/AETIC.2021.03.005\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Emerging Technologies in Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33166/AETIC.2021.03.005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 2

摘要

近年来，阿尔巴尼亚语自然语言处理资源的开发稳步增长。本文介绍了一项关于无监督学习的研究——与建立阿尔巴尼亚语词典和创建词性标注模型相关的挑战。大多数语言都有自己的词典，但资源少的语言却缺乏资源。它通过自然语言处理促进了用户和整个社区的信息和服务共享。阿尔巴尼亚语的实验语料库包括来自不同学科的250K个句子，并提出了一个词性标记标签集，可以充分代表潜在的语言现象。为阿尔巴尼亚语的发展做出贡献是本文的目的。对阿尔巴尼亚语语料库的实验结果显示，其冠词和代词的使用与其他资源丰富的语言相似。根据这项研究，总期望频率作为正确标记单词的一种手段已被证明是有效的填充阿尔巴尼亚语词典。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning

The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Annals of Emerging Technologies in Computing Computer Science-Computer Science (all)

CiteScore

3.50

自引率

0.00%

发文量