Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora

IF 0.5 Q4 COMPUTER SCIENCE, THEORY & METHODS
Rolands Laucis, Gints Jēkabsons
{"title":"Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora","authors":"Rolands Laucis, Gints Jēkabsons","doi":"10.2478/acss-2021-0016","DOIUrl":null,"url":null,"abstract":"Abstract Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. In this study, several experiments were carried out in three NLP tasks on four different methods of creating word embeddings: word2vec, fastText, Structured Skip-Gram and ngram2vec. The obtained results can serve as a baseline for future research on the Latvian language in NLP. The main conclusions are the following: First, in the part-of-speech task, using a training corpus 46 times smaller than in a previous study, the accuracy was 91.4 % (versus 98.3 % in the previous study). Second, fastText demonstrated the overall best effectiveness. Third, the best results for all methods were observed for embeddings with a dimension size of 200. Finally, word lemmatization generally did not improve results.","PeriodicalId":41960,"journal":{"name":"Applied Computer Systems","volume":"28 3","pages":"132 - 138"},"PeriodicalIF":0.5000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/acss-2021-0016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. In this study, several experiments were carried out in three NLP tasks on four different methods of creating word embeddings: word2vec, fastText, Structured Skip-Gram and ngram2vec. The obtained results can serve as a baseline for future research on the Latvian language in NLP. The main conclusions are the following: First, in the part-of-speech task, using a training corpus 46 times smaller than in a previous study, the accuracy was 91.4 % (versus 98.3 % in the previous study). Second, fastText demonstrated the overall best effectiveness. Third, the best results for all methods were observed for embeddings with a dimension size of 200. Finally, word lemmatization generally did not improve results.
基于公开语料库的拉脱维亚语NLP任务的词嵌入模型评价
目前,自然语言处理(NLP)越来越依赖于预训练的词嵌入来完成各种任务。然而,很少有研究专门针对拉脱维亚语——一种比英语更复杂的语言。本研究在三个NLP任务中对四种不同的词嵌入生成方法:word2vec、fastText、Structured Skip-Gram和ngram2vec进行了实验。所获得的结果可以作为未来拉脱维亚语在自然语言处理中的研究的基线。主要结论如下:首先,在词性任务中,使用的训练语料库比之前的研究小46倍,准确率为91.4%(而之前的研究为98.3%)。其次,fastText显示出最佳的总体效果。第三,对于尺寸为200的嵌入,所有方法的效果最好。最后,词序化一般不会改善结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Applied Computer Systems
Applied Computer Systems COMPUTER SCIENCE, THEORY & METHODS-
自引率
10.00%
发文量
9
审稿时长
30 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信