马来语词嵌入模型的内在评价

Yeong-Tsann Phua, K. Yew, O. Foong, M. Teow
{"title":"马来语词嵌入模型的内在评价","authors":"Yeong-Tsann Phua, K. Yew, O. Foong, M. Teow","doi":"10.1109/ICCI51257.2020.9247707","DOIUrl":null,"url":null,"abstract":"Word embeddings were created to form meaningful representation for words in an efficient manner. This is an essential step in most of the Natural Language Processing tasks. In this paper, different Malay language word embedding models were trained on Malay text corpus. These models were trained using Word2Vec and fastText using both CBOW and Skip-gram architectures, and GloVe. These trained models were tested on intrinsic evaluation for semantic similarity and word analogies. In the experiment, the custom-trained fastText Skip-gram model achieved 0.5509 for Pearson correlation coefficient at word similarity evaluation, and 36.80% for accuracy at word analogies evaluation. The result outperformed the fastText pre-trained models which only achieved 0.477 and 22.96% for word similarity evaluation and word analogies evaluation, respectively. The result shows that there is still room for improvement in both pre-processing tasks and datasets for evaluation.","PeriodicalId":194158,"journal":{"name":"2020 International Conference on Computational Intelligence (ICCI)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Assessing Suitable Word Embedding Model for Malay Language through Intrinsic Evaluation\",\"authors\":\"Yeong-Tsann Phua, K. Yew, O. Foong, M. Teow\",\"doi\":\"10.1109/ICCI51257.2020.9247707\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word embeddings were created to form meaningful representation for words in an efficient manner. This is an essential step in most of the Natural Language Processing tasks. In this paper, different Malay language word embedding models were trained on Malay text corpus. These models were trained using Word2Vec and fastText using both CBOW and Skip-gram architectures, and GloVe. These trained models were tested on intrinsic evaluation for semantic similarity and word analogies. In the experiment, the custom-trained fastText Skip-gram model achieved 0.5509 for Pearson correlation coefficient at word similarity evaluation, and 36.80% for accuracy at word analogies evaluation. The result outperformed the fastText pre-trained models which only achieved 0.477 and 22.96% for word similarity evaluation and word analogies evaluation, respectively. The result shows that there is still room for improvement in both pre-processing tasks and datasets for evaluation.\",\"PeriodicalId\":194158,\"journal\":{\"name\":\"2020 International Conference on Computational Intelligence (ICCI)\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Conference on Computational Intelligence (ICCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCI51257.2020.9247707\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Computational Intelligence (ICCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCI51257.2020.9247707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

词嵌入是为了有效地为词形成有意义的表示而产生的。这是大多数自然语言处理任务中必不可少的一步。本文在马来语文本语料库上训练了不同的马来语词嵌入模型。这些模型使用Word2Vec和fastText进行训练,使用CBOW和Skip-gram架构以及GloVe。对这些训练好的模型进行了语义相似度和词语类比度的内在评价。在实验中,自定义训练的fastText Skip-gram模型在单词相似度评价上的Pearson相关系数达到0.5509,在单词相似度评价上的准确率达到36.80%。结果优于fastText预训练模型,前者的词相似度评价和词相似度评价分别达到0.477和22.96%。结果表明,在预处理任务和评估数据集方面仍有改进的空间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Assessing Suitable Word Embedding Model for Malay Language through Intrinsic Evaluation
Word embeddings were created to form meaningful representation for words in an efficient manner. This is an essential step in most of the Natural Language Processing tasks. In this paper, different Malay language word embedding models were trained on Malay text corpus. These models were trained using Word2Vec and fastText using both CBOW and Skip-gram architectures, and GloVe. These trained models were tested on intrinsic evaluation for semantic similarity and word analogies. In the experiment, the custom-trained fastText Skip-gram model achieved 0.5509 for Pearson correlation coefficient at word similarity evaluation, and 36.80% for accuracy at word analogies evaluation. The result outperformed the fastText pre-trained models which only achieved 0.477 and 22.96% for word similarity evaluation and word analogies evaluation, respectively. The result shows that there is still room for improvement in both pre-processing tasks and datasets for evaluation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信