Training and evaluation of vector models for Galician

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation Pub Date : 2024-06-04 DOI:10.1007/s10579-024-09740-0

Marcos Garcia

{"title":"Training and evaluation of vector models for Galician","authors":"Marcos Garcia","doi":"10.1007/s10579-024-09740-0","DOIUrl":null,"url":null,"abstract":"This paper presents a large and systematic assessment of distributional models for Galician. To this end, we have first trained and evaluated static word embeddings (e.g., word2vec, GloVe), and then compared their performance with that of current contextualised representations generated by neural language models. First, we have compiled and processed a large corpus for Galician, and created four datasets for word analogies and concept categorisation based on standard resources for other languages. Using the aforementioned corpus, we have trained 760 static vector space models which vary in their input representations (e.g., adjacency-based versus dependency-based approaches), learning algorithms, size of the surrounding contexts, and in the number of vector dimensions. These models have been evaluated both intrinsically, using the newly created datasets, and on extrinsic tasks, namely on POS-tagging, dependency parsing, and named entity recognition. The results provide new insights into the performance of different vector models in Galician, and about the impact of several training parameters on each task. In general, fastText embeddings are the static representations with the best performance in the intrinsic evaluations and in named entity recognition, while syntax-based embeddings achieve the highest results in POS-tagging and dependency parsing, indicating that there is no significant correlation between the performance in the intrinsic and extrinsic tasks. Finally, we have compared the performance of static vector representations with that of BERT-based word embeddings, whose fine-tuning obtains the best performance on named entity recognition. This comparison provides a comprehensive state-of-the-art of current models in Galician, and releases new transformer-based models for NER. All the resources used in this research are freely available to the community, and the best models have been incorporated into SemantiGal, an online tool to explore vector representations for Galician.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"3 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09740-0","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents a large and systematic assessment of distributional models for Galician. To this end, we have first trained and evaluated static word embeddings (e.g., word2vec, GloVe), and then compared their performance with that of current contextualised representations generated by neural language models. First, we have compiled and processed a large corpus for Galician, and created four datasets for word analogies and concept categorisation based on standard resources for other languages. Using the aforementioned corpus, we have trained 760 static vector space models which vary in their input representations (e.g., adjacency-based versus dependency-based approaches), learning algorithms, size of the surrounding contexts, and in the number of vector dimensions. These models have been evaluated both intrinsically, using the newly created datasets, and on extrinsic tasks, namely on POS-tagging, dependency parsing, and named entity recognition. The results provide new insights into the performance of different vector models in Galician, and about the impact of several training parameters on each task. In general, fastText embeddings are the static representations with the best performance in the intrinsic evaluations and in named entity recognition, while syntax-based embeddings achieve the highest results in POS-tagging and dependency parsing, indicating that there is no significant correlation between the performance in the intrinsic and extrinsic tasks. Finally, we have compared the performance of static vector representations with that of BERT-based word embeddings, whose fine-tuning obtains the best performance on named entity recognition. This comparison provides a comprehensive state-of-the-art of current models in Galician, and releases new transformer-based models for NER. All the resources used in this research are freely available to the community, and the best models have been incorporated into SemantiGal, an online tool to explore vector representations for Galician.

Abstract Image

查看原文本刊更多论文

加利西亚语矢量模型的训练和评估

本文对加利西亚语的分布模型进行了大规模的系统评估。为此，我们首先对静态词嵌入（如 word2vec、GloVe）进行了训练和评估，然后将其性能与当前由神经语言模型生成的上下文表征进行了比较。首先，我们为加利西亚语编译和处理了一个大型语料库，并根据其他语言的标准资源创建了四个用于词语类比和概念分类的数据集。利用上述语料库，我们训练了 760 个静态向量空间模型，这些模型在输入表征（如基于邻接的方法和基于依赖的方法）、学习算法、周围上下文的大小以及向量维数方面各不相同。我们利用新创建的数据集对这些模型进行了内在评估，并对外在任务（即 POS 标记、依赖关系解析和命名实体识别）进行了评估。评估结果为了解不同向量模型在加利西亚语中的性能以及多个训练参数对各项任务的影响提供了新的视角。总的来说，在内在评估和命名实体识别中，快速文本嵌入是性能最好的静态表示，而在 POS 标记和依赖关系解析中，基于语法的嵌入取得了最高的成绩，这表明内在任务和外在任务的性能之间没有明显的相关性。最后，我们比较了静态向量表示法和基于 BERT 的词嵌入法的性能，后者的微调在命名实体识别中取得了最佳性能。通过比较，我们全面了解了目前加利西亚语模型的最新情况，并发布了新的基于转换器的 NER 模型。这项研究中使用的所有资源都免费向社区开放，最佳模型已被纳入SemantiGal，这是一个探索加利西亚语向量表示的在线工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.