语义消歧的预训练词嵌入模型与自定义词嵌入模型的比较

IF 1.6 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal Pub Date : 2023-11-01 DOI:10.14201/adcaij.31084

Muhammad Farhat Ullah, Ali Saeed, Naveed Hussain

{"title":"语义消歧的预训练词嵌入模型与自定义词嵌入模型的比较","authors":"Muhammad Farhat Ullah, Ali Saeed, Naveed Hussain","doi":"10.14201/adcaij.31084","DOIUrl":null,"url":null,"abstract":"The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.","PeriodicalId":42597,"journal":{"name":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","volume":"142 1","pages":"0"},"PeriodicalIF":1.6000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation\",\"authors\":\"Muhammad Farhat Ullah, Ali Saeed, Naveed Hussain\",\"doi\":\"10.14201/adcaij.31084\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.\",\"PeriodicalId\":42597,\"journal\":{\"name\":\"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal\",\"volume\":\"142 1\",\"pages\":\"0\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2023-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14201/adcaij.31084\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14201/adcaij.31084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

词义消歧(WSD)的主要目标是开发能够自动识别句子中歧义词的实际意义(意义)的机器。水务署可以改善各种NLP和HCI挑战。研究者们探索了各种各样的方法来解决这一问题。然而，他们主要关注的是英语和其他一些著名的语言。乌尔都语有超过3亿的用户，网络上有大量的电子文本，但乌尔都语仍未开发。近年来，对于各种自然语言处理任务，词嵌入方法已经被证明是非常成功的。本研究评估、比较并应用了多种乌尔都语词嵌入方法(包括Lexical Sample和All-Words)，包括预训练(Word2Vec、Glove和FastText)和自定义训练(在Ur-Mono语料库上训练的Word2Vec、Glove和FastText)。本研究使用两个基准语料库进行评价:(1)UAW-WSD-18语料库和(2)ULS-WSD-18语料库。对于乌尔都语全词WSD任务，使用预训练的FastText获得了最佳结果(准确率=60.07,F1=0.45)。对于Lexical Sample，使用定制训练的GloVe词嵌入方法实现了WSD(准确率=70.93,F1=0.60)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation

The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

1.40

自引率

0.00%

发文量

审稿时长

4 weeks