Evaluation and Verification of NLP Datasets for the Albanian Language

Labehat Kryeziu, Visar Shehu, Agron Chaushi
{"title":"Evaluation and Verification of NLP Datasets for the Albanian Language","authors":"Labehat Kryeziu, Visar Shehu, Agron Chaushi","doi":"10.1109/ICAIoT57170.2022.10121823","DOIUrl":null,"url":null,"abstract":"Computational Linguistics has seen tremendous growth and provided users with high end applications in the form of automatic translation tools, speech recognition, speech synthesis etc. However, such advancements are lacking for low resource languages. Our research aims to tackle one of these challenges, specifically advancing Computational Linguistics and Natural Language Processing for the Albanian Language. To develop accurate NLP tools, one must have a consistent and clean dataset for that language. In this paper we evaluate two well-known text corpora: OSCAR and CCAligned. The results are compared with a dataset that we have collected and curated, which we will refer in this paper as alb_dataset. Various statistical means have been used to compare and evaluate the datasets. Conclusions of this paper can be used by NLP researchers of the Albanian language before they use one of the text corpora mentioned above.","PeriodicalId":297735,"journal":{"name":"2022 International Conference on Artificial Intelligence of Things (ICAIoT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Artificial Intelligence of Things (ICAIoT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAIoT57170.2022.10121823","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Computational Linguistics has seen tremendous growth and provided users with high end applications in the form of automatic translation tools, speech recognition, speech synthesis etc. However, such advancements are lacking for low resource languages. Our research aims to tackle one of these challenges, specifically advancing Computational Linguistics and Natural Language Processing for the Albanian Language. To develop accurate NLP tools, one must have a consistent and clean dataset for that language. In this paper we evaluate two well-known text corpora: OSCAR and CCAligned. The results are compared with a dataset that we have collected and curated, which we will refer in this paper as alb_dataset. Various statistical means have been used to compare and evaluate the datasets. Conclusions of this paper can be used by NLP researchers of the Albanian language before they use one of the text corpora mentioned above.
阿尔巴尼亚语NLP数据集的评估与验证
计算语言学得到了巨大的发展,并以自动翻译工具、语音识别、语音合成等形式为用户提供了高端应用。然而,对于低资源语言来说,这种进步是缺乏的。我们的研究旨在解决这些挑战之一,特别是推进阿尔巴尼亚语的计算语言学和自然语言处理。为了开发准确的NLP工具,必须为该语言提供一致且干净的数据集。本文对两个著名的文本语料库OSCAR和CCAligned进行了评价。结果与我们收集和整理的数据集(本文将其称为alb_dataset)进行了比较。已经使用了各种统计方法来比较和评估数据集。本文的结论可供阿尔巴尼亚语的NLP研究人员在使用上述文本语料库之前使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信