Recommendations for analysing and meta-analysing small sample size software engineering experiments

IF 3.5 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Barbara Kitchenham, Lech Madeyski
{"title":"Recommendations for analysing and meta-analysing small sample size software engineering experiments","authors":"Barbara Kitchenham, Lech Madeyski","doi":"10.1007/s10664-024-10504-1","DOIUrl":null,"url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Software engineering (SE) experiments often have small sample sizes. This can result in data sets with non-normal characteristics, which poses problems as standard parametric meta-analysis, using the standardized mean difference (<i>StdMD</i>) effect size, assumes normally distributed sample data. Small sample sizes and non-normal data set characteristics can also lead to unreliable estimates of parametric effect sizes. Meta-analysis is even more complicated if experiments use complex experimental designs, such as two-group and four-group cross-over designs, which are popular in SE experiments.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>Our objective was to develop a validated and robust meta-analysis method that can help to address the problems of small sample sizes and complex experimental designs without relying upon data samples being normally distributed.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>To illustrate the challenges, we used real SE data sets. We built upon previous research and developed a robust meta-analysis method able to deal with challenges typical for SE experiments. We validated our method via simulations comparing <i>StdMD</i> with two robust alternatives: the probability of superiority (<span>\\(\\hat{p}\\)</span>) and Cliffs’ <i>d</i>.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>We confirmed that many SE data sets are small and that small experiments run the risk of exhibiting non-normal properties, which can cause problems for analysing families of experiments. For simulations of individual experiments and meta-analyses of families of experiments, <span>\\(\\hat{p}\\)</span> and Cliff’s <i>d</i> consistently outperformed <i>StdMD</i> in terms of negligible small sample bias. They also had better power for log-normal and Laplace samples, although lower power for normal and gamma samples. Tests based on <span>\\(\\hat{p}\\)</span> always had better or equal power than tests based on Cliff’s <i>d</i>, and across all but one simulation condition, <span>\\(\\hat{p}\\)</span> Type 1 error rates were less biased.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Using <span>\\(\\hat{p}\\)</span> is a low-risk option for analysing and meta-analysing data from small sample-size SE randomized experiments. Parametric methods are only preferable if you have prior knowledge of the data distribution.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"281 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10504-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Context

Software engineering (SE) experiments often have small sample sizes. This can result in data sets with non-normal characteristics, which poses problems as standard parametric meta-analysis, using the standardized mean difference (StdMD) effect size, assumes normally distributed sample data. Small sample sizes and non-normal data set characteristics can also lead to unreliable estimates of parametric effect sizes. Meta-analysis is even more complicated if experiments use complex experimental designs, such as two-group and four-group cross-over designs, which are popular in SE experiments.

Objective

Our objective was to develop a validated and robust meta-analysis method that can help to address the problems of small sample sizes and complex experimental designs without relying upon data samples being normally distributed.

Method

To illustrate the challenges, we used real SE data sets. We built upon previous research and developed a robust meta-analysis method able to deal with challenges typical for SE experiments. We validated our method via simulations comparing StdMD with two robust alternatives: the probability of superiority (\(\hat{p}\)) and Cliffs’ d.

Results

We confirmed that many SE data sets are small and that small experiments run the risk of exhibiting non-normal properties, which can cause problems for analysing families of experiments. For simulations of individual experiments and meta-analyses of families of experiments, \(\hat{p}\) and Cliff’s d consistently outperformed StdMD in terms of negligible small sample bias. They also had better power for log-normal and Laplace samples, although lower power for normal and gamma samples. Tests based on \(\hat{p}\) always had better or equal power than tests based on Cliff’s d, and across all but one simulation condition, \(\hat{p}\) Type 1 error rates were less biased.

Conclusions

Using \(\hat{p}\) is a low-risk option for analysing and meta-analysing data from small sample-size SE randomized experiments. Parametric methods are only preferable if you have prior knowledge of the data distribution.

Abstract Image

对小样本量软件工程实验进行分析和元分析的建议
背景软件工程(SE)实验的样本量通常较小。这可能导致数据集具有非正态分布特征,从而带来问题,因为使用标准化均值差异(stdMD)效应大小的标准参数元分析假定样本数据是正态分布的。小样本量和非正态数据集特征也会导致参数效应大小的估计值不可靠。如果实验采用了复杂的实验设计,如 SE 实验中常用的两组和四组交叉设计,则元分析会更加复杂。我们的目标是开发一种经过验证的稳健元分析方法,它可以帮助解决小样本量和复杂实验设计的问题,而无需依赖数据样本的正态分布。我们在以往研究的基础上,开发了一种稳健的荟萃分析方法,能够应对 SE 实验中的典型挑战。我们通过模拟验证了我们的方法,并将 StdMD 与两个稳健的替代方法进行了比较:优越性概率(\(\hat{p}\))和 Cliffs' d.结果我们证实,许多 SE 数据集都很小,而且小实验有可能表现出非正态属性,这可能会给实验族的分析带来问题。对于单个实验的模拟和实验族的元分析,\(\hhat{p}\) 和 Cliff's d 在可忽略的小样本偏差方面始终优于 StdMD。在对数正态和拉普拉斯样本方面,它们也有更好的功率,但在正态和伽马样本方面功率较低。基于 \(\hat{p}\) 的检验总是比基于 Cliff's d 的检验具有更好的或相同的功率,而且除了一种模拟条件外,在所有条件下, \(\hat{p}\) 类型 1 错误率的偏差都较小。参数方法只有在事先了解数据分布的情况下才更可取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Empirical Software Engineering
Empirical Software Engineering 工程技术-计算机:软件工程
CiteScore
8.50
自引率
12.20%
发文量
169
审稿时长
>12 weeks
期刊介绍: Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信