A Word Embedding Model for Fault Localization using Bug and Software Change Repositories

Q4 Environmental Science
Aqib Rehman
{"title":"A Word Embedding Model for Fault Localization using Bug and Software Change Repositories","authors":"Aqib Rehman","doi":"10.33897/fujeas.v1i1.201","DOIUrl":null,"url":null,"abstract":"Software developed and then deployed in a real world environment is inevitable to exhibit some undesirable behavior. Therefore, developers need to provide maintenance facilities to enable the bugs causing the undesirable behavior to be fixed. However, prior to fixing the bug, the suspicious part of the code needs to be identified. For this purpose, they usually perform fault localization. This can be done manually as well as automatically. Several techniques exist in the literature for fault localization. However, most of them are static based techniques because they do not depend on a specific programming language along with the possibility to work on underdeveloped software and some other benefits. These techniques are largely based on lexical matching of terms which leads to mismatch of terms, large precision value because of limited vocabulary of a programming language and some techniques consider the semantics but it is computationally expensive to localize faults through this. In this paper we have proposed a fault localization technique which is based on the machine learning concept of word embedding. Our proposed approach aims at looking at the relatedness between the bug terms and source code artifact. We mined the bug repositories and software change repositories to train the word embedding model on the mined repositories data. On the arrival of a new bug, the cluster of the bugs from the model is searched and the files from the software change repositories are retrieved which are used for fixing those bugs. We have compared the results of our approach with the latest technique proposed in year 2018 Pointwise Mutual Information (PMI) and Normalized Google Distance (NGD) which consider the context and also with existing lexical techniques Vector Space Model (VSM) and the semantic based method Latent Semantic Indexing (LSI). We have used the benchmark dataset “MoreBugs” which has been widely used in this domain. The results show that our approach outperforms other techniques.","PeriodicalId":36255,"journal":{"name":"Iranian Journal of Botany","volume":"120 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Iranian Journal of Botany","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33897/fujeas.v1i1.201","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Environmental Science","Score":null,"Total":0}
引用次数: 0

Abstract

Software developed and then deployed in a real world environment is inevitable to exhibit some undesirable behavior. Therefore, developers need to provide maintenance facilities to enable the bugs causing the undesirable behavior to be fixed. However, prior to fixing the bug, the suspicious part of the code needs to be identified. For this purpose, they usually perform fault localization. This can be done manually as well as automatically. Several techniques exist in the literature for fault localization. However, most of them are static based techniques because they do not depend on a specific programming language along with the possibility to work on underdeveloped software and some other benefits. These techniques are largely based on lexical matching of terms which leads to mismatch of terms, large precision value because of limited vocabulary of a programming language and some techniques consider the semantics but it is computationally expensive to localize faults through this. In this paper we have proposed a fault localization technique which is based on the machine learning concept of word embedding. Our proposed approach aims at looking at the relatedness between the bug terms and source code artifact. We mined the bug repositories and software change repositories to train the word embedding model on the mined repositories data. On the arrival of a new bug, the cluster of the bugs from the model is searched and the files from the software change repositories are retrieved which are used for fixing those bugs. We have compared the results of our approach with the latest technique proposed in year 2018 Pointwise Mutual Information (PMI) and Normalized Google Distance (NGD) which consider the context and also with existing lexical techniques Vector Space Model (VSM) and the semantic based method Latent Semantic Indexing (LSI). We have used the benchmark dataset “MoreBugs” which has been widely used in this domain. The results show that our approach outperforms other techniques.
基于Bug和软件变更库的故障定位词嵌入模型
开发并部署到现实环境中的软件不可避免地会出现一些不良行为。因此,开发人员需要提供维护工具来修复导致不良行为的错误。但是,在修复错误之前,需要识别代码的可疑部分。为此,他们通常执行故障定位。这可以手动完成,也可以自动完成。文献中存在几种故障定位技术。然而,它们中的大多数都是基于静态的技术,因为它们不依赖于特定的编程语言,并且可以在未开发的软件上工作,并具有其他一些好处。这些技术主要基于术语的词汇匹配,这导致了术语的不匹配,由于编程语言的词汇量有限,精度值很大,一些技术考虑了语义,但通过这种方法来定位错误的计算成本很高。本文提出了一种基于词嵌入机器学习概念的故障定位技术。我们建议的方法旨在查看bug术语和源代码工件之间的关系。我们挖掘bug库和软件变更库,在挖掘的库数据上训练词嵌入模型。在出现新错误时,将搜索模型中的错误集群,并检索用于修复这些错误的软件变更存储库中的文件。我们将我们的方法与2018年提出的考虑上下文的点互信息(PMI)和归一化谷歌距离(NGD)的最新技术以及现有的词汇技术向量空间模型(VSM)和基于语义的潜在语义索引(LSI)方法进行了比较。我们使用了在该领域广泛使用的基准数据集“MoreBugs”。结果表明,我们的方法优于其他技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Iranian Journal of Botany
Iranian Journal of Botany Environmental Science-Ecology
CiteScore
0.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信