An Empirical Evaluation of Distribution-based Thresholds for Internal Software Measures

Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering Pub Date : 2016-09-09 DOI:10.1145/2972958.2972965

L. Lavazza, S. Morasca

{"title":"An Empirical Evaluation of Distribution-based Thresholds for Internal Software Measures","authors":"L. Lavazza, S. Morasca","doi":"10.1145/2972958.2972965","DOIUrl":null,"url":null,"abstract":"Background Setting thresholds is important for the practical use of internal software measures, so software modules can be classified as having either acceptable or unacceptable quality, and software practitioners can take appropriate quality improvement actions. Quite a few methods have been proposed for setting thresholds and several of them are based on the distribution of an internal measure's values (and, possibly, other internal measures), without any explicit relationship with any external software quality of interest. Objective In this paper, we empirically investigate the consequences of defining thresholds on internal measures without taking into account the external measures that quantify qualities of practical interest. We focus on fault-proneness as the specific quality of practical interest. Method We analyzed datasets from the PROMISE repository. First, we computed the thresholds of code measures according to three distribution-based methods. Then, we derived statistically significant models of fault-proneness that use internal measures as independent variables. We then evaluated the indications provided by the distribution-based thresholds when used along with the fault-proneness models. Results Some methods for defining distribution-based thresholds requires that code measures be normally distributed. However, we found that this is hardly ever the case with the PROMISE datasets, making that entire class of methods inapplicable. We adapted these methods for non-normal distributions and obtained thresholds that appear reasonable, but are characterized by a large variation in the fault-proneness risk level they entail. Given a dataset, the thresholds for different internal measures---when used as independent variables of statistically significant models---provide fairly different values of fault-proneness. This is quite dangerous for practitioners, since they get thresholds that are presented as equally important, but practically can correspond to very different levels of user-perceivable quality. For other distribution-based methods, we found that the proposed thresholds are practically useless, as many modules with values of internal measures deemed acceptable according to the thresholds actually have high fault-proneness. Also, the accuracy of all of these methods appears to be lower than the accuracy obtained by simply estimating modules at random. Conclusions Our results indicate that distribution-based thresholds appear to be unreliable in providing sensible indications about the quality of software modules. Practitioners should instead use different kinds of threshold-setting methods, such as the ones that take into account data about the presence of faults in software modules, in addition to the values of internal software measures.","PeriodicalId":176848,"journal":{"name":"Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2972958.2972965","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Background Setting thresholds is important for the practical use of internal software measures, so software modules can be classified as having either acceptable or unacceptable quality, and software practitioners can take appropriate quality improvement actions. Quite a few methods have been proposed for setting thresholds and several of them are based on the distribution of an internal measure's values (and, possibly, other internal measures), without any explicit relationship with any external software quality of interest. Objective In this paper, we empirically investigate the consequences of defining thresholds on internal measures without taking into account the external measures that quantify qualities of practical interest. We focus on fault-proneness as the specific quality of practical interest. Method We analyzed datasets from the PROMISE repository. First, we computed the thresholds of code measures according to three distribution-based methods. Then, we derived statistically significant models of fault-proneness that use internal measures as independent variables. We then evaluated the indications provided by the distribution-based thresholds when used along with the fault-proneness models. Results Some methods for defining distribution-based thresholds requires that code measures be normally distributed. However, we found that this is hardly ever the case with the PROMISE datasets, making that entire class of methods inapplicable. We adapted these methods for non-normal distributions and obtained thresholds that appear reasonable, but are characterized by a large variation in the fault-proneness risk level they entail. Given a dataset, the thresholds for different internal measures---when used as independent variables of statistically significant models---provide fairly different values of fault-proneness. This is quite dangerous for practitioners, since they get thresholds that are presented as equally important, but practically can correspond to very different levels of user-perceivable quality. For other distribution-based methods, we found that the proposed thresholds are practically useless, as many modules with values of internal measures deemed acceptable according to the thresholds actually have high fault-proneness. Also, the accuracy of all of these methods appears to be lower than the accuracy obtained by simply estimating modules at random. Conclusions Our results indicate that distribution-based thresholds appear to be unreliable in providing sensible indications about the quality of software modules. Practitioners should instead use different kinds of threshold-setting methods, such as the ones that take into account data about the presence of faults in software modules, in addition to the values of internal software measures.

查看原文本刊更多论文

基于分布的软件内部度量阈值的实证评价

设置阈值对于内部软件度量的实际使用是重要的，因此软件模块可以被分类为具有可接受或不可接受的质量，并且软件从业者可以采取适当的质量改进行动。已经提出了相当多的设置阈值的方法，其中一些是基于内部度量值的分布(可能还有其他内部度量)，与任何感兴趣的外部软件质量没有任何明确的关系。在本文中，我们实证研究了在不考虑量化实际利益质量的外部措施的情况下定义内部措施阈值的后果。我们将倾向于错误作为实践兴趣的具体品质。方法对PROMISE数据库中的数据集进行分析。首先，我们根据三种基于分布的方法计算了代码度量的阈值。然后，我们导出了统计上显著的模型的错误倾向，使用内部措施作为自变量。然后，我们评估了基于分布的阈值在与断层倾向模型一起使用时提供的指示。结果一些定义基于分布的阈值的方法要求代码度量为正态分布。然而，我们发现PROMISE数据集几乎不存在这种情况，这使得整个类方法都不适用。我们将这些方法用于非正态分布，并获得了看似合理的阈值，但其特征在于它们所带来的故障倾向风险水平的大变化。给定一个数据集，不同内部度量的阈值——当用作统计显著模型的独立变量时——提供了相当不同的故障倾向值。这对实践者来说是相当危险的，因为他们得到了同样重要的阈值，但实际上可能对应于非常不同的用户可感知质量水平。对于其他基于分布的方法，我们发现所提出的阈值实际上是无用的，因为许多模块的内部度量值根据阈值被认为是可接受的，实际上具有高的故障倾向。此外，所有这些方法的精度似乎都低于简单地随机估计模块所获得的精度。结论我们的研究结果表明，基于分布的阈值在提供关于软件模块质量的合理指示方面似乎是不可靠的。实践者应该使用不同种类的阈值设置方法，例如除了内部软件度量值之外，还要考虑软件模块中存在的错误的数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering

自引率

0.00%

发文量