Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation

IF 1.7 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Muhammad Farhat Ullah, Ali Saeed, Naveed Hussain
{"title":"Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation","authors":"Muhammad Farhat Ullah, Ali Saeed, Naveed Hussain","doi":"10.14201/adcaij.31084","DOIUrl":null,"url":null,"abstract":"The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.","PeriodicalId":42597,"journal":{"name":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","volume":null,"pages":null},"PeriodicalIF":1.7000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14201/adcaij.31084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.
语义消歧的预训练词嵌入模型与自定义词嵌入模型的比较
词义消歧(WSD)的主要目标是开发能够自动识别句子中歧义词的实际意义(意义)的机器。水务署可以改善各种NLP和HCI挑战。研究者们探索了各种各样的方法来解决这一问题。然而,他们主要关注的是英语和其他一些著名的语言。乌尔都语有超过3亿的用户,网络上有大量的电子文本,但乌尔都语仍未开发。近年来,对于各种自然语言处理任务,词嵌入方法已经被证明是非常成功的。本研究评估、比较并应用了多种乌尔都语词嵌入方法(包括Lexical Sample和All-Words),包括预训练(Word2Vec、Glove和FastText)和自定义训练(在Ur-Mono语料库上训练的Word2Vec、Glove和FastText)。本研究使用两个基准语料库进行评价:(1)UAW-WSD-18语料库和(2)ULS-WSD-18语料库。对于乌尔都语全词WSD任务,使用预训练的FastText获得了最佳结果(准确率=60.07,F1=0.45)。对于Lexical Sample,使用定制训练的GloVe词嵌入方法实现了WSD(准确率=70.93,F1=0.60)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.40
自引率
0.00%
发文量
22
审稿时长
4 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信