Automated construction of a software-specific word similarity database

2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE) Pub Date : 2014-02-27 DOI:10.1109/CSMR-WCRE.2014.6747213

Yuan Tian, D. Lo, J. Lawall

{"title":"Automated construction of a software-specific word similarity database","authors":"Yuan Tian, D. Lo, J. Lawall","doi":"10.1109/CSMR-WCRE.2014.6747213","DOIUrl":null,"url":null,"abstract":"Many automated software engineering approaches, including code search, bug report categorization, and duplicate bug report detection, measure similarities between two documents by analyzing natural language contents. Often different words are used to express the same meaning and thus measuring similarities using exact matching of words is insufficient. To solve this problem, past studies have shown the need to measure the similarities between pairs of words. To meet this need, the natural language processing community has built WordNet which is a manually constructed lexical database that records semantic relations among words and can be used to measure how similar two words are. However, WordNet is a general purpose resource, and often does not contain software-specific words. Also, the meanings of words in WordNet are often different than when they are used in software engineering context. Thus, there is a need for a software-specific WordNet-like resource that can measure similarities of words. In this work, we propose an automated approach that builds a software-specific WordNet like resource, named WordSimSEDB, by leveraging the textual contents of posts in StackOverflow. Our approach measures the similarity of words by computing the similarities of the weighted co-occurrences of these words with three types of words in the textual corpus. We have evaluated our approach on a set of software-specific words and compared our approach with an existing WordNet-based technique (WordNetres) to return top-k most similar words. Human judges are used to evaluate the effectiveness of the two techniques. We find that WordNetres returns no result for 55 % of the queries. For the remaining queries, WordNetres returns significantly poorer results.","PeriodicalId":166271,"journal":{"name":"2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE)","volume":"23 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"89","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSMR-WCRE.2014.6747213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 89

Abstract

Many automated software engineering approaches, including code search, bug report categorization, and duplicate bug report detection, measure similarities between two documents by analyzing natural language contents. Often different words are used to express the same meaning and thus measuring similarities using exact matching of words is insufficient. To solve this problem, past studies have shown the need to measure the similarities between pairs of words. To meet this need, the natural language processing community has built WordNet which is a manually constructed lexical database that records semantic relations among words and can be used to measure how similar two words are. However, WordNet is a general purpose resource, and often does not contain software-specific words. Also, the meanings of words in WordNet are often different than when they are used in software engineering context. Thus, there is a need for a software-specific WordNet-like resource that can measure similarities of words. In this work, we propose an automated approach that builds a software-specific WordNet like resource, named WordSimSEDB, by leveraging the textual contents of posts in StackOverflow. Our approach measures the similarity of words by computing the similarities of the weighted co-occurrences of these words with three types of words in the textual corpus. We have evaluated our approach on a set of software-specific words and compared our approach with an existing WordNet-based technique (WordNetres) to return top-k most similar words. Human judges are used to evaluate the effectiveness of the two techniques. We find that WordNetres returns no result for 55 % of the queries. For the remaining queries, WordNetres returns significantly poorer results.

查看原文本刊更多论文

自动构建特定于软件的单词相似度数据库

许多自动化的软件工程方法，包括代码搜索、错误报告分类和重复错误报告检测，通过分析自然语言内容来度量两个文档之间的相似性。通常使用不同的单词来表达相同的意思，因此使用单词的精确匹配来衡量相似度是不够的。为了解决这个问题，过去的研究表明需要测量单词对之间的相似度。为了满足这一需求，自然语言处理社区建立了WordNet，这是一个人工构建的词汇数据库，记录单词之间的语义关系，并可用于测量两个单词的相似程度。然而，WordNet是一种通用资源，通常不包含特定于软件的单词。此外，WordNet中单词的含义通常与在软件工程上下文中使用时不同。因此，需要一种软件专用的类似wordnet的资源来测量单词的相似性。在这项工作中，我们提出了一种自动化的方法，通过利用StackOverflow中帖子的文本内容来构建一个特定于软件的类似WordNet的资源，名为WordSimSEDB。我们的方法通过计算这些词与文本语料库中三种类型的词的加权共现的相似度来测量词的相似度。我们在一组特定于软件的单词上评估了我们的方法，并将我们的方法与现有的基于wordnet的技术(WordNetres)进行了比较，以返回top-k最相似的单词。人类裁判被用来评估这两种技术的有效性。我们发现WordNetres对55%的查询没有返回结果。对于其余的查询，WordNetres返回的结果要差得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE)

自引率

0.00%

发文量