{"title":"Recommendations for analysing and meta-analysing small sample size software engineering experiments","authors":"Barbara Kitchenham, Lech Madeyski","doi":"10.1007/s10664-024-10504-1","DOIUrl":null,"url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Software engineering (SE) experiments often have small sample sizes. This can result in data sets with non-normal characteristics, which poses problems as standard parametric meta-analysis, using the standardized mean difference (<i>StdMD</i>) effect size, assumes normally distributed sample data. Small sample sizes and non-normal data set characteristics can also lead to unreliable estimates of parametric effect sizes. Meta-analysis is even more complicated if experiments use complex experimental designs, such as two-group and four-group cross-over designs, which are popular in SE experiments.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>Our objective was to develop a validated and robust meta-analysis method that can help to address the problems of small sample sizes and complex experimental designs without relying upon data samples being normally distributed.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>To illustrate the challenges, we used real SE data sets. We built upon previous research and developed a robust meta-analysis method able to deal with challenges typical for SE experiments. We validated our method via simulations comparing <i>StdMD</i> with two robust alternatives: the probability of superiority (<span>\\(\\hat{p}\\)</span>) and Cliffs’ <i>d</i>.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>We confirmed that many SE data sets are small and that small experiments run the risk of exhibiting non-normal properties, which can cause problems for analysing families of experiments. For simulations of individual experiments and meta-analyses of families of experiments, <span>\\(\\hat{p}\\)</span> and Cliff’s <i>d</i> consistently outperformed <i>StdMD</i> in terms of negligible small sample bias. They also had better power for log-normal and Laplace samples, although lower power for normal and gamma samples. Tests based on <span>\\(\\hat{p}\\)</span> always had better or equal power than tests based on Cliff’s <i>d</i>, and across all but one simulation condition, <span>\\(\\hat{p}\\)</span> Type 1 error rates were less biased.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Using <span>\\(\\hat{p}\\)</span> is a low-risk option for analysing and meta-analysing data from small sample-size SE randomized experiments. Parametric methods are only preferable if you have prior knowledge of the data distribution.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"281 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10504-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Context
Software engineering (SE) experiments often have small sample sizes. This can result in data sets with non-normal characteristics, which poses problems as standard parametric meta-analysis, using the standardized mean difference (StdMD) effect size, assumes normally distributed sample data. Small sample sizes and non-normal data set characteristics can also lead to unreliable estimates of parametric effect sizes. Meta-analysis is even more complicated if experiments use complex experimental designs, such as two-group and four-group cross-over designs, which are popular in SE experiments.
Objective
Our objective was to develop a validated and robust meta-analysis method that can help to address the problems of small sample sizes and complex experimental designs without relying upon data samples being normally distributed.
Method
To illustrate the challenges, we used real SE data sets. We built upon previous research and developed a robust meta-analysis method able to deal with challenges typical for SE experiments. We validated our method via simulations comparing StdMD with two robust alternatives: the probability of superiority (\(\hat{p}\)) and Cliffs’ d.
Results
We confirmed that many SE data sets are small and that small experiments run the risk of exhibiting non-normal properties, which can cause problems for analysing families of experiments. For simulations of individual experiments and meta-analyses of families of experiments, \(\hat{p}\) and Cliff’s d consistently outperformed StdMD in terms of negligible small sample bias. They also had better power for log-normal and Laplace samples, although lower power for normal and gamma samples. Tests based on \(\hat{p}\) always had better or equal power than tests based on Cliff’s d, and across all but one simulation condition, \(\hat{p}\) Type 1 error rates were less biased.
Conclusions
Using \(\hat{p}\) is a low-risk option for analysing and meta-analysing data from small sample-size SE randomized experiments. Parametric methods are only preferable if you have prior knowledge of the data distribution.
背景软件工程(SE)实验的样本量通常较小。这可能导致数据集具有非正态分布特征,从而带来问题,因为使用标准化均值差异(stdMD)效应大小的标准参数元分析假定样本数据是正态分布的。小样本量和非正态数据集特征也会导致参数效应大小的估计值不可靠。如果实验采用了复杂的实验设计,如 SE 实验中常用的两组和四组交叉设计,则元分析会更加复杂。我们的目标是开发一种经过验证的稳健元分析方法,它可以帮助解决小样本量和复杂实验设计的问题,而无需依赖数据样本的正态分布。我们在以往研究的基础上,开发了一种稳健的荟萃分析方法,能够应对 SE 实验中的典型挑战。我们通过模拟验证了我们的方法,并将 StdMD 与两个稳健的替代方法进行了比较:优越性概率(\(\hat{p}\))和 Cliffs' d.结果我们证实,许多 SE 数据集都很小,而且小实验有可能表现出非正态属性,这可能会给实验族的分析带来问题。对于单个实验的模拟和实验族的元分析,\(\hhat{p}\) 和 Cliff's d 在可忽略的小样本偏差方面始终优于 StdMD。在对数正态和拉普拉斯样本方面,它们也有更好的功率,但在正态和伽马样本方面功率较低。基于 \(\hat{p}\) 的检验总是比基于 Cliff's d 的检验具有更好的或相同的功率,而且除了一种模拟条件外,在所有条件下, \(\hat{p}\) 类型 1 错误率的偏差都较小。参数方法只有在事先了解数据分布的情况下才更可取。
期刊介绍:
Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories.
The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings.
Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.