章节编者按：使用Power Insights更好地规划实验

IF 1.1 3区社会学 Q2 SOCIAL SCIENCES, INTERDISCIPLINARY

American Journal of Evaluation Pub Date : 2023-03-01 DOI:10.1177/10982140231154695

Laura R. Peck

{"title":"章节编者按：使用Power Insights更好地规划实验","authors":"Laura R. Peck","doi":"10.1177/10982140231154695","DOIUrl":null,"url":null,"abstract":"How many people need to be in my evaluation in order to be able to detect a policyor programrelevant impact? If the program being evaluated is assigned to participants at an aggregate “cluster” or group level—such as classrooms filled with students—how many of those groups do I need? How many participants within each group? What if I am interested in subgroup effects; how many people or groups do I need then? Answers to the questions are essential for smart planning of experimental evaluations and are the motivation for this Experimental Methodology Section. Before I summarize the contributions of this Section’s three articles, let me first define some key concepts and explain what I see to be the main issues for this piece of experimental evaluation work. To begin, statistical “power” refers to an evaluation’s ability to detect an effect that is statistically significant; and minimum detectable effects (MDEs) are the smallest estimated effect that a given design can detect as statistically significant. Ultimately, the effect size is what a given evaluation is designed to estimate, and the evaluator will have to determine (1) what sample design and size is needed to detect that effect, or (2) what MDE is feasible, given budget and sample design and size realities. Several interrelated factors influence a study’s MDE, including (as drawn partly from Peck, 2020, Appendix Box A.1) the choices and realities of statistical significance threshold, statistical power, variance of the impact estimate, the level and variability of the outcome measure, and the clustered nature of the data, as elaborated next. Statistical significance threshold. The statistical significance level is the probability of identifying a false positive result (also referred to as Type I error). The MDE becomes larger as the statistical significance level decreases. All else equal, an impact must be larger to be detected with a statistical significance threshold of 1% than with a statistical significance threshold of 10%. Substantial debate in statistics and related fields focuses on “the p-value” and its value to establishing evidence (e.g., Wasserstein & Lazar, 2016). Statistical power. The statistical power is equal to the probability of correctly rejecting the null hypothesis (or, one minus the probability of a false negative result, or Type II error). In other words, power relates to the analyst’s ability to detect an impact that is statistically significant, should it exist. Statistical power is typically set to 80%, although other values may be reasonable too. Missing the detection of a favorable impact (Type II error) has lower up-front cost implications for the study, relative to falsely claiming that a favorable impact exists (Type I error). That said, an insufficiently powered study might lead to not generating new information (or, worse, to incorrect null findings), an ill-funded investment.","PeriodicalId":51449,"journal":{"name":"American Journal of Evaluation","volume":"44 1","pages":"114 - 117"},"PeriodicalIF":1.1000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Section Editor's Note: Using Power Insights to Better Plan Experiments\",\"authors\":\"Laura R. Peck\",\"doi\":\"10.1177/10982140231154695\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"How many people need to be in my evaluation in order to be able to detect a policyor programrelevant impact? If the program being evaluated is assigned to participants at an aggregate “cluster” or group level—such as classrooms filled with students—how many of those groups do I need? How many participants within each group? What if I am interested in subgroup effects; how many people or groups do I need then? Answers to the questions are essential for smart planning of experimental evaluations and are the motivation for this Experimental Methodology Section. Before I summarize the contributions of this Section’s three articles, let me first define some key concepts and explain what I see to be the main issues for this piece of experimental evaluation work. To begin, statistical “power” refers to an evaluation’s ability to detect an effect that is statistically significant; and minimum detectable effects (MDEs) are the smallest estimated effect that a given design can detect as statistically significant. Ultimately, the effect size is what a given evaluation is designed to estimate, and the evaluator will have to determine (1) what sample design and size is needed to detect that effect, or (2) what MDE is feasible, given budget and sample design and size realities. Several interrelated factors influence a study’s MDE, including (as drawn partly from Peck, 2020, Appendix Box A.1) the choices and realities of statistical significance threshold, statistical power, variance of the impact estimate, the level and variability of the outcome measure, and the clustered nature of the data, as elaborated next. Statistical significance threshold. The statistical significance level is the probability of identifying a false positive result (also referred to as Type I error). The MDE becomes larger as the statistical significance level decreases. All else equal, an impact must be larger to be detected with a statistical significance threshold of 1% than with a statistical significance threshold of 10%. Substantial debate in statistics and related fields focuses on “the p-value” and its value to establishing evidence (e.g., Wasserstein & Lazar, 2016). Statistical power. The statistical power is equal to the probability of correctly rejecting the null hypothesis (or, one minus the probability of a false negative result, or Type II error). In other words, power relates to the analyst’s ability to detect an impact that is statistically significant, should it exist. Statistical power is typically set to 80%, although other values may be reasonable too. Missing the detection of a favorable impact (Type II error) has lower up-front cost implications for the study, relative to falsely claiming that a favorable impact exists (Type I error). That said, an insufficiently powered study might lead to not generating new information (or, worse, to incorrect null findings), an ill-funded investment.\",\"PeriodicalId\":51449,\"journal\":{\"name\":\"American Journal of Evaluation\",\"volume\":\"44 1\",\"pages\":\"114 - 117\"},\"PeriodicalIF\":1.1000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American Journal of Evaluation\",\"FirstCategoryId\":\"90\",\"ListUrlMain\":\"https://doi.org/10.1177/10982140231154695\",\"RegionNum\":3,\"RegionCategory\":\"社会学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"SOCIAL SCIENCES, INTERDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Evaluation","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/10982140231154695","RegionNum":3,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOCIAL SCIENCES, INTERDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

我的评估需要多少人才能检测到政策或计划的相关影响？如果正在评估的项目被分配给总体“集群”或小组级别的参与者，比如满是学生的教室，我需要多少小组？每组有多少参与者？如果我对子群效应感兴趣怎么办；那么我需要多少人或团体？这些问题的答案对于明智地规划实验评估至关重要，也是本实验方法论部分的动机。在我总结本节三篇文章的贡献之前，让我首先定义一些关键概念，并解释我认为这项实验评估工作的主要问题。首先，统计“能力”指的是评估检测统计显著影响的能力；最小可检测效应（MDE）是给定设计可以检测到的具有统计学意义的最小估计效应。最终，效果大小是给定评估的估计值，评估者必须确定（1）检测该效果需要什么样的样本设计和大小，或者（2）在给定预算和样本设计以及大小现实的情况下，什么样的MDE是可行的。几个相互关联的因素影响研究的MDE，包括（部分来源于Peck，2020，附录框a.1）统计显著性阈值、统计能力、影响估计的方差、结果测量的水平和可变性以及数据的聚类性质的选择和现实，如下文所述。统计显著性阈值。统计显著性水平是识别假阳性结果（也称为I型错误）的概率。MDE随着统计显著性水平的降低而变大。在其他条件相同的情况下，统计显著性阈值为1%时要检测到的影响必须大于统计显著性门限为10%时的影响。统计学和相关领域的实质性辩论集中在“p值”及其对建立证据的价值上（例如，Wasserstein和Lazar，2016）。统计能力。统计幂等于正确拒绝零假设的概率（或者，一减去假阴性结果或II型错误的概率）。换句话说，权力与分析师检测具有统计意义的影响的能力有关，如果这种影响存在的话。统计能力通常设置为80%，尽管其他值也可能是合理的。与谎称存在有利影响（I型错误）相比，错过有利影响的检测（II型错误）对研究的前期成本影响较低。也就是说，一项能力不足的研究可能会导致无法产生新的信息（或者更糟的是，导致不正确的无效结果），这是一项资金不足的投资。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Section Editor's Note: Using Power Insights to Better Plan Experiments

How many people need to be in my evaluation in order to be able to detect a policyor programrelevant impact? If the program being evaluated is assigned to participants at an aggregate “cluster” or group level—such as classrooms filled with students—how many of those groups do I need? How many participants within each group? What if I am interested in subgroup effects; how many people or groups do I need then? Answers to the questions are essential for smart planning of experimental evaluations and are the motivation for this Experimental Methodology Section. Before I summarize the contributions of this Section’s three articles, let me first define some key concepts and explain what I see to be the main issues for this piece of experimental evaluation work. To begin, statistical “power” refers to an evaluation’s ability to detect an effect that is statistically significant; and minimum detectable effects (MDEs) are the smallest estimated effect that a given design can detect as statistically significant. Ultimately, the effect size is what a given evaluation is designed to estimate, and the evaluator will have to determine (1) what sample design and size is needed to detect that effect, or (2) what MDE is feasible, given budget and sample design and size realities. Several interrelated factors influence a study’s MDE, including (as drawn partly from Peck, 2020, Appendix Box A.1) the choices and realities of statistical significance threshold, statistical power, variance of the impact estimate, the level and variability of the outcome measure, and the clustered nature of the data, as elaborated next. Statistical significance threshold. The statistical significance level is the probability of identifying a false positive result (also referred to as Type I error). The MDE becomes larger as the statistical significance level decreases. All else equal, an impact must be larger to be detected with a statistical significance threshold of 1% than with a statistical significance threshold of 10%. Substantial debate in statistics and related fields focuses on “the p-value” and its value to establishing evidence (e.g., Wasserstein & Lazar, 2016). Statistical power. The statistical power is equal to the probability of correctly rejecting the null hypothesis (or, one minus the probability of a false negative result, or Type II error). In other words, power relates to the analyst’s ability to detect an impact that is statistically significant, should it exist. Statistical power is typically set to 80%, although other values may be reasonable too. Missing the detection of a favorable impact (Type II error) has lower up-front cost implications for the study, relative to falsely claiming that a favorable impact exists (Type I error). That said, an insufficiently powered study might lead to not generating new information (or, worse, to incorrect null findings), an ill-funded investment.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

American Journal of Evaluation SOCIAL SCIENCES, INTERDISCIPLINARY-

CiteScore

4.40

自引率

11.80%

发文量

期刊介绍： The American Journal of Evaluation (AJE) publishes original papers about the methods, theory, practice, and findings of evaluation. The general goal of AJE is to present the best work in and about evaluation, in order to improve the knowledge base and practice of its readers. Because the field of evaluation is diverse, with different intellectual traditions, approaches to practice, and domains of application, the papers published in AJE will reflect this diversity. Nevertheless, preference is given to papers that are likely to be of interest to a wide range of evaluators and that are written to be accessible to most readers.