{"title":"Evaluation and Verification of NLP Datasets for the Albanian Language","authors":"Labehat Kryeziu, Visar Shehu, Agron Chaushi","doi":"10.1109/ICAIoT57170.2022.10121823","DOIUrl":null,"url":null,"abstract":"Computational Linguistics has seen tremendous growth and provided users with high end applications in the form of automatic translation tools, speech recognition, speech synthesis etc. However, such advancements are lacking for low resource languages. Our research aims to tackle one of these challenges, specifically advancing Computational Linguistics and Natural Language Processing for the Albanian Language. To develop accurate NLP tools, one must have a consistent and clean dataset for that language. In this paper we evaluate two well-known text corpora: OSCAR and CCAligned. The results are compared with a dataset that we have collected and curated, which we will refer in this paper as alb_dataset. Various statistical means have been used to compare and evaluate the datasets. Conclusions of this paper can be used by NLP researchers of the Albanian language before they use one of the text corpora mentioned above.","PeriodicalId":297735,"journal":{"name":"2022 International Conference on Artificial Intelligence of Things (ICAIoT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Artificial Intelligence of Things (ICAIoT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAIoT57170.2022.10121823","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Computational Linguistics has seen tremendous growth and provided users with high end applications in the form of automatic translation tools, speech recognition, speech synthesis etc. However, such advancements are lacking for low resource languages. Our research aims to tackle one of these challenges, specifically advancing Computational Linguistics and Natural Language Processing for the Albanian Language. To develop accurate NLP tools, one must have a consistent and clean dataset for that language. In this paper we evaluate two well-known text corpora: OSCAR and CCAligned. The results are compared with a dataset that we have collected and curated, which we will refer in this paper as alb_dataset. Various statistical means have been used to compare and evaluate the datasets. Conclusions of this paper can be used by NLP researchers of the Albanian language before they use one of the text corpora mentioned above.