基于自然语言处理的重复缺陷模式检测

Qian Wu, Qianxiang Wang
{"title":"基于自然语言处理的重复缺陷模式检测","authors":"Qian Wu, Qianxiang Wang","doi":"10.1109/COMPSACW.2010.45","DOIUrl":null,"url":null,"abstract":"A Defect pattern repository collects different kinds of defect patterns, which are general descriptions of the characteristics of commonly occurring software code defects. Defect patterns can be widely used by programmers, static defect analysis tools, and even runtime verification. Following the idea of web 2.0, defect pattern repositories allow these users to submit defect patterns they found. However, submission of duplicate patterns would lead to a redundancy in the repository. This paper introduces an approach to suggest potential duplicates based on natural language processing. Our approach first computes field similarities based on Vector Space Model, and then employs Information Entropy to determine the field importance, and next combines the field similarities to form the final defect pattern similarity. Two strategies are introduced to make our approach adaptive to special situations. Finally, groups of duplicates are obtained by adopting Hierarchical Clustering. Evaluation indicates that our approach could detect most of the actual duplicates (72% in our experiment) in the repository.","PeriodicalId":121135,"journal":{"name":"2010 IEEE 34th Annual Computer Software and Applications Conference Workshops","volume":"226 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Natural Language Processing Based Detection of Duplicate Defect Patterns\",\"authors\":\"Qian Wu, Qianxiang Wang\",\"doi\":\"10.1109/COMPSACW.2010.45\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A Defect pattern repository collects different kinds of defect patterns, which are general descriptions of the characteristics of commonly occurring software code defects. Defect patterns can be widely used by programmers, static defect analysis tools, and even runtime verification. Following the idea of web 2.0, defect pattern repositories allow these users to submit defect patterns they found. However, submission of duplicate patterns would lead to a redundancy in the repository. This paper introduces an approach to suggest potential duplicates based on natural language processing. Our approach first computes field similarities based on Vector Space Model, and then employs Information Entropy to determine the field importance, and next combines the field similarities to form the final defect pattern similarity. Two strategies are introduced to make our approach adaptive to special situations. Finally, groups of duplicates are obtained by adopting Hierarchical Clustering. Evaluation indicates that our approach could detect most of the actual duplicates (72% in our experiment) in the repository.\",\"PeriodicalId\":121135,\"journal\":{\"name\":\"2010 IEEE 34th Annual Computer Software and Applications Conference Workshops\",\"volume\":\"226 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE 34th Annual Computer Software and Applications Conference Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/COMPSACW.2010.45\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE 34th Annual Computer Software and Applications Conference Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSACW.2010.45","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

缺陷模式存储库收集了不同种类的缺陷模式,这些缺陷模式是对常见软件代码缺陷特征的一般描述。缺陷模式可以被程序员、静态缺陷分析工具甚至运行时验证广泛使用。遵循web 2.0的思想,缺陷模式存储库允许这些用户提交他们发现的缺陷模式。但是,提交重复的模式会导致存储库中的冗余。本文介绍了一种基于自然语言处理的潜在重复提示方法。该方法首先基于向量空间模型计算域的相似度,然后利用信息熵确定域的重要度,最后将域的相似度组合形成最终的缺陷模式相似度。引入了两种策略,使我们的方法适应特殊情况。最后,采用层次聚类方法得到重复组。评估表明,我们的方法可以检测到存储库中的大多数实际重复(在我们的实验中为72%)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Natural Language Processing Based Detection of Duplicate Defect Patterns
A Defect pattern repository collects different kinds of defect patterns, which are general descriptions of the characteristics of commonly occurring software code defects. Defect patterns can be widely used by programmers, static defect analysis tools, and even runtime verification. Following the idea of web 2.0, defect pattern repositories allow these users to submit defect patterns they found. However, submission of duplicate patterns would lead to a redundancy in the repository. This paper introduces an approach to suggest potential duplicates based on natural language processing. Our approach first computes field similarities based on Vector Space Model, and then employs Information Entropy to determine the field importance, and next combines the field similarities to form the final defect pattern similarity. Two strategies are introduced to make our approach adaptive to special situations. Finally, groups of duplicates are obtained by adopting Hierarchical Clustering. Evaluation indicates that our approach could detect most of the actual duplicates (72% in our experiment) in the repository.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信