Empirical studies on software evolution: should we (try to) claim causation?

IWPSE-EVOL '10 Pub Date : 2010-09-20 DOI:10.1145/1862372.1862375

M. D. Penta

{"title":"Empirical studies on software evolution: should we (try to) claim causation?","authors":"M. D. Penta","doi":"10.1145/1862372.1862375","DOIUrl":null,"url":null,"abstract":"In recent and past years, there have been hundreds of studies aimed at characterizing the evolution of a software system. Many of these studies analyze the behavior of a variable over a given period of observation. How does the size of a software system evolve? What about its complexity? Does the number of defects increase over time or does it remain stable?\n In some cases, studies also attempt to correlate variables, and, possibly, to build predictors upon them. This is to say, one could estimate the likelihood that a fault occurs in a class, based on some metrics the class exhibits, on the kinds of changes the class underwent. Similarly, change couplings can be inferred by observing how artifacts tend to co-change. Although in many cases we are able to obtain models ensuring good prediction performances, we are not able to claim any causal-effect relationship between our independent and dependent variables. We could easily correlate the presence of some design constructs with the change-proneness of a software component, however the same correlation could be found with the amount of good Belgian beer our developers drink. As a matter of fact, the component could undergo changes for other, external reasons.\n Recent software evolution studies rely on fine-grained information mined by integrating several kinds of repositories, such as versioning systems, bug tracking systems, or mailing lists. Nowadays, many other precious sources of information, ranging from code search repositories, vulnerability databases, informal communications, and legal documents are also being considered. This would possibly aid to capture the rationale of some events occurring in a software project, and link them to statistical relations we observed.\n The road towards shifting from solid empirical models towards \"principles of software evolution\" will likely be long and difficult, therefore we should prepare ourselves to traverse it and go as far as possible with limited damages. To do this, we need to carefully prepare our traveling equipment by paying attention at: (i) combining quantitative studies with qualitative studies, surveys, and informal interviews, (ii) relating social relations among developers with variables observed on the project, (iii) using proper statistical and machine learning techniques able to capture the temporal relation among different events, and (iv) making a massive use of natural language processing and text mining among the various sources of information available.","PeriodicalId":443035,"journal":{"name":"IWPSE-EVOL '10","volume":"109 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IWPSE-EVOL '10","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1862372.1862375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

In recent and past years, there have been hundreds of studies aimed at characterizing the evolution of a software system. Many of these studies analyze the behavior of a variable over a given period of observation. How does the size of a software system evolve? What about its complexity? Does the number of defects increase over time or does it remain stable? In some cases, studies also attempt to correlate variables, and, possibly, to build predictors upon them. This is to say, one could estimate the likelihood that a fault occurs in a class, based on some metrics the class exhibits, on the kinds of changes the class underwent. Similarly, change couplings can be inferred by observing how artifacts tend to co-change. Although in many cases we are able to obtain models ensuring good prediction performances, we are not able to claim any causal-effect relationship between our independent and dependent variables. We could easily correlate the presence of some design constructs with the change-proneness of a software component, however the same correlation could be found with the amount of good Belgian beer our developers drink. As a matter of fact, the component could undergo changes for other, external reasons. Recent software evolution studies rely on fine-grained information mined by integrating several kinds of repositories, such as versioning systems, bug tracking systems, or mailing lists. Nowadays, many other precious sources of information, ranging from code search repositories, vulnerability databases, informal communications, and legal documents are also being considered. This would possibly aid to capture the rationale of some events occurring in a software project, and link them to statistical relations we observed. The road towards shifting from solid empirical models towards "principles of software evolution" will likely be long and difficult, therefore we should prepare ourselves to traverse it and go as far as possible with limited damages. To do this, we need to carefully prepare our traveling equipment by paying attention at: (i) combining quantitative studies with qualitative studies, surveys, and informal interviews, (ii) relating social relations among developers with variables observed on the project, (iii) using proper statistical and machine learning techniques able to capture the temporal relation among different events, and (iv) making a massive use of natural language processing and text mining among the various sources of information available.

查看原文本刊更多论文

软件进化的实证研究:我们应该(尝试)宣称因果关系吗?

在最近和过去的几年里，已经有数百项旨在描述软件系统演化特征的研究。许多这类研究分析的是一个变量在一段给定的观察期内的行为。软件系统的规模是如何演变的?它的复杂性如何?缺陷的数量是随着时间的推移而增加还是保持稳定?在某些情况下，研究还试图将变量联系起来，并可能在它们的基础上建立预测因子。也就是说，可以根据类显示的一些度量，以及类所经历的各种更改来估计类中发生故障的可能性。类似地，变更耦合可以通过观察工件如何倾向于共同变更来推断。虽然在许多情况下，我们能够获得保证良好预测性能的模型，但我们不能声称自变量和因变量之间存在任何因果关系。我们可以很容易地将一些设计结构的存在与软件组件的变化倾向联系起来，然而，同样的相关性也可以与我们的开发人员喝的比利时啤酒的数量联系起来。事实上，组件可能由于其他外部原因而发生变化。最近的软件进化研究依赖于通过集成几种存储库(如版本控制系统、缺陷跟踪系统或邮件列表)挖掘的细粒度信息。如今，许多其他宝贵的信息来源，包括代码搜索存储库、漏洞数据库、非正式通信和法律文件也在考虑之中。这可能有助于捕获软件项目中发生的一些事件的基本原理，并将它们与我们观察到的统计关系联系起来。从坚实的经验模型转向“软件进化原则”的道路可能是漫长而艰难的，因此我们应该准备好穿越它，在有限的损害下尽可能走得更远。要做到这一点，我们需要仔细准备我们的旅行装备，注意以下几点:(i)将定量研究与定性研究、调查和非正式访谈相结合，(ii)将开发人员之间的社会关系与项目中观察到的变量联系起来，(iii)使用能够捕获不同事件之间时间关系的适当统计和机器学习技术，以及(iv)在各种可用信息来源中大量使用自然语言处理和文本挖掘。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IWPSE-EVOL '10

自引率

0.00%

发文量