Evaluation and Verification of NLP Datasets for the Albanian Language

2022 International Conference on Artificial Intelligence of Things (ICAIoT) Pub Date : 2022-12-29 DOI:10.1109/ICAIoT57170.2022.10121823

Labehat Kryeziu, Visar Shehu, Agron Chaushi

引用次数: 1

Abstract

Computational Linguistics has seen tremendous growth and provided users with high end applications in the form of automatic translation tools, speech recognition, speech synthesis etc. However, such advancements are lacking for low resource languages. Our research aims to tackle one of these challenges, specifically advancing Computational Linguistics and Natural Language Processing for the Albanian Language. To develop accurate NLP tools, one must have a consistent and clean dataset for that language. In this paper we evaluate two well-known text corpora: OSCAR and CCAligned. The results are compared with a dataset that we have collected and curated, which we will refer in this paper as alb_dataset. Various statistical means have been used to compare and evaluate the datasets. Conclusions of this paper can be used by NLP researchers of the Albanian language before they use one of the text corpora mentioned above.

查看原文本刊更多论文

阿尔巴尼亚语NLP数据集的评估与验证

计算语言学得到了巨大的发展，并以自动翻译工具、语音识别、语音合成等形式为用户提供了高端应用。然而，对于低资源语言来说，这种进步是缺乏的。我们的研究旨在解决这些挑战之一，特别是推进阿尔巴尼亚语的计算语言学和自然语言处理。为了开发准确的NLP工具，必须为该语言提供一致且干净的数据集。本文对两个著名的文本语料库OSCAR和CCAligned进行了评价。结果与我们收集和整理的数据集(本文将其称为alb_dataset)进行了比较。已经使用了各种统计方法来比较和评估数据集。本文的结论可供阿尔巴尼亚语的NLP研究人员在使用上述文本语料库之前使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Conference on Artificial Intelligence of Things (ICAIoT)

自引率

0.00%

发文量