Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation

IF 1.6 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal Pub Date : 2023-11-01 DOI:10.14201/adcaij.31084

Muhammad Farhat Ullah, Ali Saeed, Naveed Hussain

{"title":"Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation","authors":"Muhammad Farhat Ullah, Ali Saeed, Naveed Hussain","doi":"10.14201/adcaij.31084","DOIUrl":null,"url":null,"abstract":"The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.","PeriodicalId":42597,"journal":{"name":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","volume":"142 1","pages":"0"},"PeriodicalIF":1.6000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14201/adcaij.31084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.

查看原文本刊更多论文

语义消歧的预训练词嵌入模型与自定义词嵌入模型的比较

词义消歧(WSD)的主要目标是开发能够自动识别句子中歧义词的实际意义(意义)的机器。水务署可以改善各种NLP和HCI挑战。研究者们探索了各种各样的方法来解决这一问题。然而，他们主要关注的是英语和其他一些著名的语言。乌尔都语有超过3亿的用户，网络上有大量的电子文本，但乌尔都语仍未开发。近年来，对于各种自然语言处理任务，词嵌入方法已经被证明是非常成功的。本研究评估、比较并应用了多种乌尔都语词嵌入方法(包括Lexical Sample和All-Words)，包括预训练(Word2Vec、Glove和FastText)和自定义训练(在Ur-Mono语料库上训练的Word2Vec、Glove和FastText)。本研究使用两个基准语料库进行评价:(1)UAW-WSD-18语料库和(2)ULS-WSD-18语料库。对于乌尔都语全词WSD任务，使用预训练的FastText获得了最佳结果(准确率=60.07,F1=0.45)。对于Lexical Sample，使用定制训练的GloVe词嵌入方法实现了WSD(准确率=70.93,F1=0.60)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊