Investigation of IR based topic models on issue tracking systems to infer software-specific semantic related term pairs

2017 Tenth International Conference on Contemporary Computing (IC3) Pub Date : 2017-08-01 DOI:10.1109/IC3.2017.8284329

D. Correa, A. Sureka, Sangeeta Lal

{"title":"Investigation of IR based topic models on issue tracking systems to infer software-specific semantic related term pairs","authors":"D. Correa, A. Sureka, Sangeeta Lal","doi":"10.1109/IC3.2017.8284329","DOIUrl":null,"url":null,"abstract":"Software maintenance is a core component of any software development life-cycle. Contemporary software systems contain voluminous and complex information stored in software repositories. Software maintenance professionals spend significant amount of time in search and exploration of these repositories for common maintenance tasks like bug fixing, feature enhancements, code refactoring and reengineering. Therefore, tools and methods to facilitate search in software repositories can aid software maintenance professionals to have faster access to required information and increase productivity. A domain-specific lexical resource is an important tool to bridge the semantic gap existing between the information need and search query. In this work, we investigate the use of information retrieval (IR) based topic models (like LSI and LDA) to infer semantically related terms for a software context specific lexical resource. We perform our experiments on Google Chromium — a widely popular open-source browser — issue tracker system which contains 134,000+ bug reports. We divide our study into two parts — (1) In the first part, we apply our IR models on free form natural language textual data present in defect tracking systems. We perform qualitative analysis on the obtained output and uncover semantically related terms in the Google Chromium software context. We observe that we are able to infer semantically similar term pairs in four different contexts of English language, Software, Google Chromium and Code details. (2) In second part of this study, we utilize the semantically inferred terms obtained from the output of IR models to facilitate the software maintenance task of duplicate bug report detection. Our results demonstrate that the use of IR based topic models on defect tracking systems to automatically infer semantically related terms can help build a software domain-specific lexical resource and reduce the vocabulary gap.","PeriodicalId":147099,"journal":{"name":"2017 Tenth International Conference on Contemporary Computing (IC3)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Tenth International Conference on Contemporary Computing (IC3)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC3.2017.8284329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Software maintenance is a core component of any software development life-cycle. Contemporary software systems contain voluminous and complex information stored in software repositories. Software maintenance professionals spend significant amount of time in search and exploration of these repositories for common maintenance tasks like bug fixing, feature enhancements, code refactoring and reengineering. Therefore, tools and methods to facilitate search in software repositories can aid software maintenance professionals to have faster access to required information and increase productivity. A domain-specific lexical resource is an important tool to bridge the semantic gap existing between the information need and search query. In this work, we investigate the use of information retrieval (IR) based topic models (like LSI and LDA) to infer semantically related terms for a software context specific lexical resource. We perform our experiments on Google Chromium — a widely popular open-source browser — issue tracker system which contains 134,000+ bug reports. We divide our study into two parts — (1) In the first part, we apply our IR models on free form natural language textual data present in defect tracking systems. We perform qualitative analysis on the obtained output and uncover semantically related terms in the Google Chromium software context. We observe that we are able to infer semantically similar term pairs in four different contexts of English language, Software, Google Chromium and Code details. (2) In second part of this study, we utilize the semantically inferred terms obtained from the output of IR models to facilitate the software maintenance task of duplicate bug report detection. Our results demonstrate that the use of IR based topic models on defect tracking systems to automatically infer semantically related terms can help build a software domain-specific lexical resource and reduce the vocabulary gap.

查看原文本刊更多论文

问题跟踪系统中基于IR的主题模型研究，以推断软件特定的语义相关术语对

软件维护是任何软件开发生命周期的核心组成部分。当代软件系统包含存储在软件存储库中的大量复杂信息。软件维护专业人员花费大量时间搜索和探索这些存储库，以完成常见的维护任务，如bug修复、功能增强、代码重构和再工程。因此，促进在软件存储库中搜索的工具和方法可以帮助软件维护专业人员更快地访问所需的信息并提高工作效率。特定于领域的词汇资源是弥合信息需求和搜索查询之间存在的语义差距的重要工具。在这项工作中，我们研究了使用基于信息检索(IR)的主题模型(如LSI和LDA)来推断特定于软件上下文的词汇资源的语义相关术语。我们在Google Chromium(一个广受欢迎的开源浏览器)上进行实验，这个问题跟踪系统包含134,000多个错误报告。我们将研究分为两部分——(1)在第一部分中，我们将IR模型应用于缺陷跟踪系统中存在的自由形式自然语言文本数据。我们对获得的输出执行定性分析，并在Google Chromium软件上下文中发现语义相关的术语。我们观察到，我们能够在英语语言、软件、Google Chromium和代码细节四种不同的上下文中推断出语义上相似的术语对。(2)在本研究的第二部分，我们利用从IR模型的输出中获得的语义推断项来促进重复错误报告检测的软件维护任务。我们的结果表明，在缺陷跟踪系统上使用基于IR的主题模型来自动推断语义相关的术语可以帮助构建特定于软件领域的词汇资源并减少词汇缺口。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 Tenth International Conference on Contemporary Computing (IC3)

自引率

0.00%

发文量