{"title":"抓取维基百科页面训练用于软件工程领域的词嵌入模型","authors":"S. Mishra, Arpit Sharma","doi":"10.1145/3452383.3452401","DOIUrl":null,"url":null,"abstract":"Word embeddings allow words used in similar contexts to have similar meanings. Due to this property word embeddings are widely used as an input feature for classification, clustering and sentiment analysis of natural language (NL) textual data produced during the software development process. Since the accuracy of these machine learning (ML) tasks depends on how well the word embeddings vectors capture the contextual information, it is important to train these models on software engineering (SE) specific text corpus. This paper proposes a pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words. We create a domain-specific text corpus by crawling the SE category on Wikipedia and use it for training the word embeddings model. We show that our model can outperform the state-of-the-art word embeddings model trained on Google news in terms of its representational power. More specifically, our model is able to express the SE specific meaning of polysemous words and recognize SE-specific technical words. Additionally, we also show that for almost all the words related to the fundamental SE activities, our model is either comparable or better than the SE-specific model trained over Stack Overflow (SO) posts.","PeriodicalId":378352,"journal":{"name":"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Crawling Wikipedia Pages to Train Word Embeddings Model for Software Engineering Domain\",\"authors\":\"S. Mishra, Arpit Sharma\",\"doi\":\"10.1145/3452383.3452401\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word embeddings allow words used in similar contexts to have similar meanings. Due to this property word embeddings are widely used as an input feature for classification, clustering and sentiment analysis of natural language (NL) textual data produced during the software development process. Since the accuracy of these machine learning (ML) tasks depends on how well the word embeddings vectors capture the contextual information, it is important to train these models on software engineering (SE) specific text corpus. This paper proposes a pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words. We create a domain-specific text corpus by crawling the SE category on Wikipedia and use it for training the word embeddings model. We show that our model can outperform the state-of-the-art word embeddings model trained on Google news in terms of its representational power. More specifically, our model is able to express the SE specific meaning of polysemous words and recognize SE-specific technical words. Additionally, we also show that for almost all the words related to the fundamental SE activities, our model is either comparable or better than the SE-specific model trained over Stack Overflow (SO) posts.\",\"PeriodicalId\":378352,\"journal\":{\"name\":\"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3452383.3452401\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3452383.3452401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Crawling Wikipedia Pages to Train Word Embeddings Model for Software Engineering Domain
Word embeddings allow words used in similar contexts to have similar meanings. Due to this property word embeddings are widely used as an input feature for classification, clustering and sentiment analysis of natural language (NL) textual data produced during the software development process. Since the accuracy of these machine learning (ML) tasks depends on how well the word embeddings vectors capture the contextual information, it is important to train these models on software engineering (SE) specific text corpus. This paper proposes a pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words. We create a domain-specific text corpus by crawling the SE category on Wikipedia and use it for training the word embeddings model. We show that our model can outperform the state-of-the-art word embeddings model trained on Google news in terms of its representational power. More specifically, our model is able to express the SE specific meaning of polysemous words and recognize SE-specific technical words. Additionally, we also show that for almost all the words related to the fundamental SE activities, our model is either comparable or better than the SE-specific model trained over Stack Overflow (SO) posts.