抓取维基百科页面训练用于软件工程领域的词嵌入模型

14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference) Pub Date : 2021-02-25 DOI:10.1145/3452383.3452401

S. Mishra, Arpit Sharma

{"title":"抓取维基百科页面训练用于软件工程领域的词嵌入模型","authors":"S. Mishra, Arpit Sharma","doi":"10.1145/3452383.3452401","DOIUrl":null,"url":null,"abstract":"Word embeddings allow words used in similar contexts to have similar meanings. Due to this property word embeddings are widely used as an input feature for classification, clustering and sentiment analysis of natural language (NL) textual data produced during the software development process. Since the accuracy of these machine learning (ML) tasks depends on how well the word embeddings vectors capture the contextual information, it is important to train these models on software engineering (SE) specific text corpus. This paper proposes a pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words. We create a domain-specific text corpus by crawling the SE category on Wikipedia and use it for training the word embeddings model. We show that our model can outperform the state-of-the-art word embeddings model trained on Google news in terms of its representational power. More specifically, our model is able to express the SE specific meaning of polysemous words and recognize SE-specific technical words. Additionally, we also show that for almost all the words related to the fundamental SE activities, our model is either comparable or better than the SE-specific model trained over Stack Overflow (SO) posts.","PeriodicalId":378352,"journal":{"name":"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Crawling Wikipedia Pages to Train Word Embeddings Model for Software Engineering Domain\",\"authors\":\"S. Mishra, Arpit Sharma\",\"doi\":\"10.1145/3452383.3452401\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word embeddings allow words used in similar contexts to have similar meanings. Due to this property word embeddings are widely used as an input feature for classification, clustering and sentiment analysis of natural language (NL) textual data produced during the software development process. Since the accuracy of these machine learning (ML) tasks depends on how well the word embeddings vectors capture the contextual information, it is important to train these models on software engineering (SE) specific text corpus. This paper proposes a pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words. We create a domain-specific text corpus by crawling the SE category on Wikipedia and use it for training the word embeddings model. We show that our model can outperform the state-of-the-art word embeddings model trained on Google news in terms of its representational power. More specifically, our model is able to express the SE specific meaning of polysemous words and recognize SE-specific technical words. Additionally, we also show that for almost all the words related to the fundamental SE activities, our model is either comparable or better than the SE-specific model trained over Stack Overflow (SO) posts.\",\"PeriodicalId\":378352,\"journal\":{\"name\":\"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3452383.3452401\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3452383.3452401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

单词嵌入允许在相似上下文中使用的单词具有相似的含义。由于这一特性，词嵌入被广泛用作软件开发过程中产生的自然语言文本数据的分类、聚类和情感分析的输入特征。由于这些机器学习(ML)任务的准确性取决于词嵌入向量捕获上下文信息的程度，因此在软件工程(SE)特定文本语料库上训练这些模型非常重要。本文提出了一种面向语义语义的预训练词嵌入模型，该模型能够捕获和反映语义语义相关词的特定领域含义。我们通过在Wikipedia上抓取SE类别来创建一个特定领域的文本语料库，并使用它来训练词嵌入模型。我们表明，我们的模型在表征能力方面优于在谷歌新闻上训练的最先进的词嵌入模型。更具体地说，我们的模型能够表达多义词的SE特定意义，并识别SE特定的技术词汇。此外，我们还表明，对于几乎所有与SE基本活动相关的单词，我们的模型要么与在Stack Overflow (SO)帖子上训练的SE特定模型相当，要么更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Crawling Wikipedia Pages to Train Word Embeddings Model for Software Engineering Domain

Word embeddings allow words used in similar contexts to have similar meanings. Due to this property word embeddings are widely used as an input feature for classification, clustering and sentiment analysis of natural language (NL) textual data produced during the software development process. Since the accuracy of these machine learning (ML) tasks depends on how well the word embeddings vectors capture the contextual information, it is important to train these models on software engineering (SE) specific text corpus. This paper proposes a pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words. We create a domain-specific text corpus by crawling the SE category on Wikipedia and use it for training the word embeddings model. We show that our model can outperform the state-of-the-art word embeddings model trained on Google news in terms of its representational power. More specifically, our model is able to express the SE specific meaning of polysemous words and recognize SE-specific technical words. Additionally, we also show that for almost all the words related to the fundamental SE activities, our model is either comparable or better than the SE-specific model trained over Stack Overflow (SO) posts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)

自引率

0.00%

发文量