抓取维基百科页面训练用于软件工程领域的词嵌入模型

S. Mishra, Arpit Sharma
{"title":"抓取维基百科页面训练用于软件工程领域的词嵌入模型","authors":"S. Mishra, Arpit Sharma","doi":"10.1145/3452383.3452401","DOIUrl":null,"url":null,"abstract":"Word embeddings allow words used in similar contexts to have similar meanings. Due to this property word embeddings are widely used as an input feature for classification, clustering and sentiment analysis of natural language (NL) textual data produced during the software development process. Since the accuracy of these machine learning (ML) tasks depends on how well the word embeddings vectors capture the contextual information, it is important to train these models on software engineering (SE) specific text corpus. This paper proposes a pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words. We create a domain-specific text corpus by crawling the SE category on Wikipedia and use it for training the word embeddings model. We show that our model can outperform the state-of-the-art word embeddings model trained on Google news in terms of its representational power. More specifically, our model is able to express the SE specific meaning of polysemous words and recognize SE-specific technical words. Additionally, we also show that for almost all the words related to the fundamental SE activities, our model is either comparable or better than the SE-specific model trained over Stack Overflow (SO) posts.","PeriodicalId":378352,"journal":{"name":"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Crawling Wikipedia Pages to Train Word Embeddings Model for Software Engineering Domain\",\"authors\":\"S. Mishra, Arpit Sharma\",\"doi\":\"10.1145/3452383.3452401\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word embeddings allow words used in similar contexts to have similar meanings. Due to this property word embeddings are widely used as an input feature for classification, clustering and sentiment analysis of natural language (NL) textual data produced during the software development process. Since the accuracy of these machine learning (ML) tasks depends on how well the word embeddings vectors capture the contextual information, it is important to train these models on software engineering (SE) specific text corpus. This paper proposes a pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words. We create a domain-specific text corpus by crawling the SE category on Wikipedia and use it for training the word embeddings model. We show that our model can outperform the state-of-the-art word embeddings model trained on Google news in terms of its representational power. More specifically, our model is able to express the SE specific meaning of polysemous words and recognize SE-specific technical words. Additionally, we also show that for almost all the words related to the fundamental SE activities, our model is either comparable or better than the SE-specific model trained over Stack Overflow (SO) posts.\",\"PeriodicalId\":378352,\"journal\":{\"name\":\"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3452383.3452401\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3452383.3452401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

单词嵌入允许在相似上下文中使用的单词具有相似的含义。由于这一特性,词嵌入被广泛用作软件开发过程中产生的自然语言文本数据的分类、聚类和情感分析的输入特征。由于这些机器学习(ML)任务的准确性取决于词嵌入向量捕获上下文信息的程度,因此在软件工程(SE)特定文本语料库上训练这些模型非常重要。本文提出了一种面向语义语义的预训练词嵌入模型,该模型能够捕获和反映语义语义相关词的特定领域含义。我们通过在Wikipedia上抓取SE类别来创建一个特定领域的文本语料库,并使用它来训练词嵌入模型。我们表明,我们的模型在表征能力方面优于在谷歌新闻上训练的最先进的词嵌入模型。更具体地说,我们的模型能够表达多义词的SE特定意义,并识别SE特定的技术词汇。此外,我们还表明,对于几乎所有与SE基本活动相关的单词,我们的模型要么与在Stack Overflow (SO)帖子上训练的SE特定模型相当,要么更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Crawling Wikipedia Pages to Train Word Embeddings Model for Software Engineering Domain
Word embeddings allow words used in similar contexts to have similar meanings. Due to this property word embeddings are widely used as an input feature for classification, clustering and sentiment analysis of natural language (NL) textual data produced during the software development process. Since the accuracy of these machine learning (ML) tasks depends on how well the word embeddings vectors capture the contextual information, it is important to train these models on software engineering (SE) specific text corpus. This paper proposes a pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words. We create a domain-specific text corpus by crawling the SE category on Wikipedia and use it for training the word embeddings model. We show that our model can outperform the state-of-the-art word embeddings model trained on Google news in terms of its representational power. More specifically, our model is able to express the SE specific meaning of polysemous words and recognize SE-specific technical words. Additionally, we also show that for almost all the words related to the fundamental SE activities, our model is either comparable or better than the SE-specific model trained over Stack Overflow (SO) posts.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信