RegMiner: towards constructing a large regression dataset from code evolution history

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis Pub Date : 2021-09-25 DOI:10.1145/3533767.3534224

Xuezhi Song, Yun Lin, Siang Hwee Ng, Yijian Wu, Xin Peng, J. Dong, Hong Mei

{"title":"RegMiner: towards constructing a large regression dataset from code evolution history","authors":"Xuezhi Song, Yun Lin, Siang Hwee Ng, Yijian Wu, Xin Peng, J. Dong, Hong Mei","doi":"10.1145/3533767.3534224","DOIUrl":null,"url":null,"abstract":"Bug datasets lay significant empirical and experimental foundation for various SE/PL researches such as fault localization, software testing, and program repair. Current well-known datasets are constructed manually, which inevitably limits their scalability, representativeness, and the support for the emerging data-driven research. In this work, we propose an approach to automate the process of harvesting replicable regression bugs from the code evolution history. We focus on regression bugs, as they (1) manifest how a bug is introduced and fixed (as non-regression bugs), (2) support regression bug analysis, and (3) incorporate more specification (i.e., both the original passing version and the fixing version) than nonregression bug dataset for bug analysis. Technically, we address an information retrieval problem on code evolution history. Given a code repository, we search for regressions where a test can pass a regression-fixing commit, fail a regression-inducing commit, and pass a previous working commit. We address the challenges of (1) identifying potential regression-fixing commits from the code evolution history, (2) migrating the test and its code dependencies over the history, and (3) minimizing the compilation overhead during the regression search. We build our tool, RegMiner, which harvested 1035 regressions over 147 projects in 8 weeks, creating the largest replicable regression dataset within the shortest period, to the best of our knowledge. Our extensive experiments show that (1) RegMiner can construct the regression dataset with very high precision and acceptable recall, and (2) the constructed regression dataset is of high authenticity and diversity. We foresee that a continuously growing regression dataset opens many data-driven research opportunities in the SE/PL communities.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"62 23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3533767.3534224","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Bug datasets lay significant empirical and experimental foundation for various SE/PL researches such as fault localization, software testing, and program repair. Current well-known datasets are constructed manually, which inevitably limits their scalability, representativeness, and the support for the emerging data-driven research. In this work, we propose an approach to automate the process of harvesting replicable regression bugs from the code evolution history. We focus on regression bugs, as they (1) manifest how a bug is introduced and fixed (as non-regression bugs), (2) support regression bug analysis, and (3) incorporate more specification (i.e., both the original passing version and the fixing version) than nonregression bug dataset for bug analysis. Technically, we address an information retrieval problem on code evolution history. Given a code repository, we search for regressions where a test can pass a regression-fixing commit, fail a regression-inducing commit, and pass a previous working commit. We address the challenges of (1) identifying potential regression-fixing commits from the code evolution history, (2) migrating the test and its code dependencies over the history, and (3) minimizing the compilation overhead during the regression search. We build our tool, RegMiner, which harvested 1035 regressions over 147 projects in 8 weeks, creating the largest replicable regression dataset within the shortest period, to the best of our knowledge. Our extensive experiments show that (1) RegMiner can construct the regression dataset with very high precision and acceptable recall, and (2) the constructed regression dataset is of high authenticity and diversity. We foresee that a continuously growing regression dataset opens many data-driven research opportunities in the SE/PL communities.

查看原文本刊更多论文

RegMiner:从代码演化历史中构建一个大型回归数据集

Bug数据集为故障定位、软件测试、程序修复等各种SE/PL研究奠定了重要的经验和实验基础。目前已知的数据集都是手工构建的，这不可避免地限制了它们的可扩展性、代表性和对新兴数据驱动研究的支持。在这项工作中，我们提出了一种从代码演化历史中自动收集可复制的回归错误的方法。我们关注回归错误，因为它们(1)表明错误是如何引入和修复的(作为非回归错误)，(2)支持回归错误分析，以及(3)与用于错误分析的非回归错误数据集相比，包含更多的规范(即，原始传递版本和修复版本)。从技术上讲，我们解决了一个关于代码演化历史的信息检索问题。给定一个代码存储库，我们搜索回归，其中测试可以通过修复回归的提交，失败导致回归的提交，并通过先前的工作提交。我们解决了以下挑战:(1)从代码演化历史中识别潜在的修复回归的提交，(2)在历史中迁移测试及其代码依赖，以及(3)在回归搜索期间最小化编译开销。我们构建了我们的工具RegMiner，它在8周内收集了147个项目的1035个回归，在最短的时间内创建了最大的可复制回归数据集，据我们所知。我们的大量实验表明:(1)RegMiner能够以非常高的精度和可接受的召回率构建回归数据集;(2)构建的回归数据集具有很高的真实性和多样性。我们预见到，不断增长的回归数据集将为SE/PL社区带来许多数据驱动的研究机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

自引率

0.00%

发文量