Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning

Q2 Computer Science
D. Mati, Mentor Hamiti, Arsim Susuri, B. Selimi, Jaumin Ajdari
{"title":"Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning","authors":"D. Mati, Mentor Hamiti, Arsim Susuri, B. Selimi, Jaumin Ajdari","doi":"10.33166/AETIC.2021.03.005","DOIUrl":null,"url":null,"abstract":"The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.","PeriodicalId":36440,"journal":{"name":"Annals of Emerging Technologies in Computing","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Emerging Technologies in Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33166/AETIC.2021.03.005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 2

Abstract

The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.
为低资源语言构建词典:无监督学习的挑战
近年来,阿尔巴尼亚语自然语言处理资源的开发稳步增长。本文介绍了一项关于无监督学习的研究——与建立阿尔巴尼亚语词典和创建词性标注模型相关的挑战。大多数语言都有自己的词典,但资源少的语言却缺乏资源。它通过自然语言处理促进了用户和整个社区的信息和服务共享。阿尔巴尼亚语的实验语料库包括来自不同学科的250K个句子,并提出了一个词性标记标签集,可以充分代表潜在的语言现象。为阿尔巴尼亚语的发展做出贡献是本文的目的。对阿尔巴尼亚语语料库的实验结果显示,其冠词和代词的使用与其他资源丰富的语言相似。根据这项研究,总期望频率作为正确标记单词的一种手段已被证明是有效的填充阿尔巴尼亚语词典。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Annals of Emerging Technologies in Computing
Annals of Emerging Technologies in Computing Computer Science-Computer Science (all)
CiteScore
3.50
自引率
0.00%
发文量
26
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信