What Would Be the Effect of Lowering the Threshold of Statistical Significance From 0.05 to 0.005 in Foot and Ankle Randomized Controlled Trials?

IF 4.4 2区医学 Q1 ORTHOPEDICS

Clinical Orthopaedics and Related Research® Pub Date : 2025-09-17 DOI:10.1097/corr.0000000000003689

Yoshiharu Shimozono,Yuki Shinya,Shuichi Matsuda

{"title":"What Would Be the Effect of Lowering the Threshold of Statistical Significance From 0.05 to 0.005 in Foot and Ankle Randomized Controlled Trials?","authors":"Yoshiharu Shimozono,Yuki Shinya,Shuichi Matsuda","doi":"10.1097/corr.0000000000003689","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nThe threshold for statistical significance (p < 0.05) has been debated in recent years, with proposals to lower it to p < 0.005 to reduce the frequency of papers concluding with false-positive results, which can result in patients receiving overtreatment, and potentiating the problem of nonreplicable results in medical research. However, to our knowledge the impact of modeling that suggestion-in terms of how many studies might be reclassified as no-difference studies and how much larger studies would need to become to implement that suggestion-has not been assessed in orthopaedic surgery.\r\n\r\nQUESTIONS/PURPOSES\r\nWe used randomized trials in foot and ankle research to answer the question: If the threshold for statistical significance were lowered from p < 0.05 to p < 0.005, (1) what proportion of foot and ankle RCTs would be reclassified as no-difference trials under a stricter p value threshold, and (2) how much larger would studies have needed to be to retain or obtain 80% power at the p < 0.005 level?\r\n\r\nMETHODS\r\nWe manually reviewed all articles published between 2019 and 2024 in the top 10 ranked orthopaedic journals and the top three foot and ankle-specific journals, both selected based on their 2023 two-year journal Impact Factor, focusing on foot and ankle studies. Studies were included if they met the following criteria: (1) RCT design, (2) focus on foot and ankle conditions or interventions, (3) published in English, and (4) reported p values for primary outcomes. After screening, a total of 123 RCTs met these criteria and were included in the final analysis. Those studies' p values for primary endpoints were extracted and analyzed under both thresholds. If a study had multiple primary endpoints or evaluated the primary endpoint from multiple domains, all p values were included. We categorized p values into three groups based on the classification proposed by Ioannidis: (1) p < 0.005 as \"statistically significant,\" (2) 0.005 ≤ p < 0.05 as \"suggestive,\" and (3) p ≥ 0.05 as \"nonsignificant.\" For studies with sufficient power analysis data, we calculated the required sample size increase needed to maintain 80% statistical power (1 - beta) at an alpha level of 0.005, using the variance reported in the source studies. The effect size (delta) was inferred from the between-group differences reported in each study. Additionally, multivariable logistic regression analysis was performed to identify factors associated with maintaining statistical significance under the p < 0.005 threshold.\r\n\r\nRESULTS\r\nAmong 281 primary endpoints identified from 123 trials, 44% (124 of 281) were statistically significant using the threshold defined in those articles (p < 0.05). Of these significant endpoints, only 42% (52 of 124) of endpoints met the proposed threshold (p < 0.005), whereas 58% (72 of 124) fell between 0.005 and 0.05. Following the classification proposed by Benjamin et al., these endpoints would be reclassified as \"suggestive\" rather than statistically significant. Overall, only 19% (52 of 281) of all endpoints remained statistically significant under the threshold of 0.005 proposed. Twenty-five percent (31 of 123) of trials maintained statistically significant primary endpoints. Among the 123 trials, 54% (66 of 123) had sufficient power analysis data. Assuming an alpha of 0.005, power of 80%, and effect sizes derived from reported between-group differences, maintaining statistical power under the new threshold would require a mean increase of 69% in the sample size. Logistic regression analysis revealed that extracorporeal shock wave therapy (OR 6.8; p < 0.001) and injection therapy (OR 3.3; p = 0.008) were associated with maintaining significance under the stricter threshold.\r\n\r\nCONCLUSION\r\nAdopting a threshold of p < 0.005 would substantially impact the interpretation of published foot and ankle RCTs; using that threshold, more than one-half of published RCTs in foot and ankle surgery would have been reclassified as having only \"suggestive\" or no-difference findings on one or more primary study endpoints.\r\n\r\nCLINICAL RELEVANCE\r\nLowering the p value threshold to 0.005 would require larger sample sizes, posing feasibility challenges in foot and ankle surgery because of smaller patient populations. While this shift aims to reduce false-positives, it risks excluding meaningful findings from underpowered studies. More importantly, this debate highlights that no single p value threshold is universally appropriate. Instead of rigidly applying 0.05 or 0.005, researchers should adjust thresholds based on study context-allowing more relaxed thresholds for exploratory studies and stricter ones for high-risk interventions.","PeriodicalId":10404,"journal":{"name":"Clinical Orthopaedics and Related Research®","volume":"23 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Orthopaedics and Related Research®","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/corr.0000000000003689","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

BACKGROUND The threshold for statistical significance (p < 0.05) has been debated in recent years, with proposals to lower it to p < 0.005 to reduce the frequency of papers concluding with false-positive results, which can result in patients receiving overtreatment, and potentiating the problem of nonreplicable results in medical research. However, to our knowledge the impact of modeling that suggestion-in terms of how many studies might be reclassified as no-difference studies and how much larger studies would need to become to implement that suggestion-has not been assessed in orthopaedic surgery. QUESTIONS/PURPOSES We used randomized trials in foot and ankle research to answer the question: If the threshold for statistical significance were lowered from p < 0.05 to p < 0.005, (1) what proportion of foot and ankle RCTs would be reclassified as no-difference trials under a stricter p value threshold, and (2) how much larger would studies have needed to be to retain or obtain 80% power at the p < 0.005 level? METHODS We manually reviewed all articles published between 2019 and 2024 in the top 10 ranked orthopaedic journals and the top three foot and ankle-specific journals, both selected based on their 2023 two-year journal Impact Factor, focusing on foot and ankle studies. Studies were included if they met the following criteria: (1) RCT design, (2) focus on foot and ankle conditions or interventions, (3) published in English, and (4) reported p values for primary outcomes. After screening, a total of 123 RCTs met these criteria and were included in the final analysis. Those studies' p values for primary endpoints were extracted and analyzed under both thresholds. If a study had multiple primary endpoints or evaluated the primary endpoint from multiple domains, all p values were included. We categorized p values into three groups based on the classification proposed by Ioannidis: (1) p < 0.005 as "statistically significant," (2) 0.005 ≤ p < 0.05 as "suggestive," and (3) p ≥ 0.05 as "nonsignificant." For studies with sufficient power analysis data, we calculated the required sample size increase needed to maintain 80% statistical power (1 - beta) at an alpha level of 0.005, using the variance reported in the source studies. The effect size (delta) was inferred from the between-group differences reported in each study. Additionally, multivariable logistic regression analysis was performed to identify factors associated with maintaining statistical significance under the p < 0.005 threshold. RESULTS Among 281 primary endpoints identified from 123 trials, 44% (124 of 281) were statistically significant using the threshold defined in those articles (p < 0.05). Of these significant endpoints, only 42% (52 of 124) of endpoints met the proposed threshold (p < 0.005), whereas 58% (72 of 124) fell between 0.005 and 0.05. Following the classification proposed by Benjamin et al., these endpoints would be reclassified as "suggestive" rather than statistically significant. Overall, only 19% (52 of 281) of all endpoints remained statistically significant under the threshold of 0.005 proposed. Twenty-five percent (31 of 123) of trials maintained statistically significant primary endpoints. Among the 123 trials, 54% (66 of 123) had sufficient power analysis data. Assuming an alpha of 0.005, power of 80%, and effect sizes derived from reported between-group differences, maintaining statistical power under the new threshold would require a mean increase of 69% in the sample size. Logistic regression analysis revealed that extracorporeal shock wave therapy (OR 6.8; p < 0.001) and injection therapy (OR 3.3; p = 0.008) were associated with maintaining significance under the stricter threshold. CONCLUSION Adopting a threshold of p < 0.005 would substantially impact the interpretation of published foot and ankle RCTs; using that threshold, more than one-half of published RCTs in foot and ankle surgery would have been reclassified as having only "suggestive" or no-difference findings on one or more primary study endpoints. CLINICAL RELEVANCE Lowering the p value threshold to 0.005 would require larger sample sizes, posing feasibility challenges in foot and ankle surgery because of smaller patient populations. While this shift aims to reduce false-positives, it risks excluding meaningful findings from underpowered studies. More importantly, this debate highlights that no single p value threshold is universally appropriate. Instead of rigidly applying 0.05 or 0.005, researchers should adjust thresholds based on study context-allowing more relaxed thresholds for exploratory studies and stricter ones for high-risk interventions.

查看原文本刊更多论文

在足踝随机对照试验中，将统计学显著性阈值从0.05降低到0.005会有什么效果？

背景：统计显著性阈值（p < 0.05）近年来一直存在争议，有人建议将其降低到p < 0.005，以减少以假阳性结果结束的论文的频率，假阳性结果可能导致患者接受过度治疗，并加剧医学研究中结果不可复制的问题。然而，据我们所知，这一建议的建模影响——就有多少研究可能被重新归类为无差异研究以及需要进行多少更大规模的研究来实施这一建议而言——尚未在骨科手术中得到评估。问题/目的我们使用足部和踝关节研究的随机试验来回答以下问题：如果统计学显著性阈值从p < 0.05降低到p < 0.005,(1)在更严格的p值阈值下，足部和踝关节的rct有多少比例会被重新归类为无差异试验，(2)在p < 0.005水平上，需要多大的研究才能保持或获得80%的功效？方法我们人工回顾了2019 - 2024年间发表在骨科排名前10位的期刊和足部和踝关节专业排名前3位的期刊上的所有文章，这些文章都是根据其2023年的两年期期刊影响因子选择的，重点是足部和踝关节研究。符合以下标准的研究被纳入：(1)随机对照试验设计，(2)关注足部和踝关节状况或干预措施，(3)以英文发表，(4)报告了主要结局的p值。经筛选，共有123项rct符合这些标准，纳入最终分析。在两个阈值下提取和分析这些研究的主要终点的p值。如果一项研究有多个主要终点或评估来自多个领域的主要终点，则包括所有p值。根据Ioannidis提出的分类，我们将p值分为三组：(1)p < 0.005为“统计显著”，(2)0.005≤p < 0.05为“提示”，(3)p≥0.05为“不显著”。对于具有足够功率分析数据的研究，我们使用源研究报告的方差，在0.005的α水平上计算维持80%统计功率（1 - β）所需的样本量增加。效应量（delta）是从每项研究中报告的组间差异推断出来的。此外，进行多变量logistic回归分析，以确定在p < 0.005阈值下保持统计学显著性的相关因素。结果在123项试验中确定的281个主要终点中，44%（281项中的124项）使用这些文章定义的阈值具有统计学意义（p < 0.05）。在这些重要的终点中，只有42%（124个中的52个）的终点达到了建议的阈值（p < 0.005），而58%（124个中的72个）的阈值在0.005和0.05之间。根据Benjamin等人提出的分类，这些终点将被重新分类为“暗示性”，而不是统计显著性。总的来说，只有19%（281个终点中的52个）在0.005的阈值下仍然具有统计学意义。25%（123个试验中的31个）的试验维持了具有统计学意义的主要终点。在123个试验中，有54%（66 / 123）有足够的功效分析数据。假设alpha为0.005，功率为80%，效应大小来自于报告的组间差异，在新阈值下保持统计功率将需要样本量平均增加69%。Logistic回归分析显示体外冲击波治疗（OR 6.8, p < 0.001）和注射治疗（OR 3.3, p = 0.008）在更严格的阈值下保持显著性相关。结论采用p < 0.005的阈值将显著影响已发表的足部和踝关节随机对照试验的解释；使用该阈值，超过一半的已发表的足部和踝关节手术随机对照试验将被重新分类为在一个或多个主要研究终点上只有“暗示”或无差异发现。将p值阈值降低到0.005需要更大的样本量，由于患者群体较小，这对足部和踝关节手术的可行性提出了挑战。虽然这一转变旨在减少假阳性，但它可能会排除那些缺乏动力的研究中有意义的发现。更重要的是，这场辩论强调，没有单一的p值阈值是普遍适用的。研究人员不应严格应用0.05或0.005，而应根据研究背景调整阈值，对探索性研究允许更宽松的阈值，对高风险干预允许更严格的阈值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical Orthopaedics and Related Research® 医学-外科

CiteScore

7.00

自引率

11.90%

发文量

722

审稿时长

2.5 months

期刊介绍： Clinical Orthopaedics and Related Research® is a leading peer-reviewed journal devoted to the dissemination of new and important orthopaedic knowledge. CORR® brings readers the latest clinical and basic research, along with columns, commentaries, and interviews with authors.