{"title":"What Would Be the Effect of Lowering the Threshold of Statistical Significance From 0.05 to 0.005 in Foot and Ankle Randomized Controlled Trials?","authors":"Yoshiharu Shimozono,Yuki Shinya,Shuichi Matsuda","doi":"10.1097/corr.0000000000003689","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nThe threshold for statistical significance (p < 0.05) has been debated in recent years, with proposals to lower it to p < 0.005 to reduce the frequency of papers concluding with false-positive results, which can result in patients receiving overtreatment, and potentiating the problem of nonreplicable results in medical research. However, to our knowledge the impact of modeling that suggestion-in terms of how many studies might be reclassified as no-difference studies and how much larger studies would need to become to implement that suggestion-has not been assessed in orthopaedic surgery.\r\n\r\nQUESTIONS/PURPOSES\r\nWe used randomized trials in foot and ankle research to answer the question: If the threshold for statistical significance were lowered from p < 0.05 to p < 0.005, (1) what proportion of foot and ankle RCTs would be reclassified as no-difference trials under a stricter p value threshold, and (2) how much larger would studies have needed to be to retain or obtain 80% power at the p < 0.005 level?\r\n\r\nMETHODS\r\nWe manually reviewed all articles published between 2019 and 2024 in the top 10 ranked orthopaedic journals and the top three foot and ankle-specific journals, both selected based on their 2023 two-year journal Impact Factor, focusing on foot and ankle studies. Studies were included if they met the following criteria: (1) RCT design, (2) focus on foot and ankle conditions or interventions, (3) published in English, and (4) reported p values for primary outcomes. After screening, a total of 123 RCTs met these criteria and were included in the final analysis. Those studies' p values for primary endpoints were extracted and analyzed under both thresholds. If a study had multiple primary endpoints or evaluated the primary endpoint from multiple domains, all p values were included. We categorized p values into three groups based on the classification proposed by Ioannidis: (1) p < 0.005 as \"statistically significant,\" (2) 0.005 ≤ p < 0.05 as \"suggestive,\" and (3) p ≥ 0.05 as \"nonsignificant.\" For studies with sufficient power analysis data, we calculated the required sample size increase needed to maintain 80% statistical power (1 - beta) at an alpha level of 0.005, using the variance reported in the source studies. The effect size (delta) was inferred from the between-group differences reported in each study. Additionally, multivariable logistic regression analysis was performed to identify factors associated with maintaining statistical significance under the p < 0.005 threshold.\r\n\r\nRESULTS\r\nAmong 281 primary endpoints identified from 123 trials, 44% (124 of 281) were statistically significant using the threshold defined in those articles (p < 0.05). Of these significant endpoints, only 42% (52 of 124) of endpoints met the proposed threshold (p < 0.005), whereas 58% (72 of 124) fell between 0.005 and 0.05. Following the classification proposed by Benjamin et al., these endpoints would be reclassified as \"suggestive\" rather than statistically significant. Overall, only 19% (52 of 281) of all endpoints remained statistically significant under the threshold of 0.005 proposed. Twenty-five percent (31 of 123) of trials maintained statistically significant primary endpoints. Among the 123 trials, 54% (66 of 123) had sufficient power analysis data. Assuming an alpha of 0.005, power of 80%, and effect sizes derived from reported between-group differences, maintaining statistical power under the new threshold would require a mean increase of 69% in the sample size. Logistic regression analysis revealed that extracorporeal shock wave therapy (OR 6.8; p < 0.001) and injection therapy (OR 3.3; p = 0.008) were associated with maintaining significance under the stricter threshold.\r\n\r\nCONCLUSION\r\nAdopting a threshold of p < 0.005 would substantially impact the interpretation of published foot and ankle RCTs; using that threshold, more than one-half of published RCTs in foot and ankle surgery would have been reclassified as having only \"suggestive\" or no-difference findings on one or more primary study endpoints.\r\n\r\nCLINICAL RELEVANCE\r\nLowering the p value threshold to 0.005 would require larger sample sizes, posing feasibility challenges in foot and ankle surgery because of smaller patient populations. While this shift aims to reduce false-positives, it risks excluding meaningful findings from underpowered studies. More importantly, this debate highlights that no single p value threshold is universally appropriate. Instead of rigidly applying 0.05 or 0.005, researchers should adjust thresholds based on study context-allowing more relaxed thresholds for exploratory studies and stricter ones for high-risk interventions.","PeriodicalId":10404,"journal":{"name":"Clinical Orthopaedics and Related Research®","volume":"23 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Orthopaedics and Related Research®","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/corr.0000000000003689","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
BACKGROUND
The threshold for statistical significance (p < 0.05) has been debated in recent years, with proposals to lower it to p < 0.005 to reduce the frequency of papers concluding with false-positive results, which can result in patients receiving overtreatment, and potentiating the problem of nonreplicable results in medical research. However, to our knowledge the impact of modeling that suggestion-in terms of how many studies might be reclassified as no-difference studies and how much larger studies would need to become to implement that suggestion-has not been assessed in orthopaedic surgery.
QUESTIONS/PURPOSES
We used randomized trials in foot and ankle research to answer the question: If the threshold for statistical significance were lowered from p < 0.05 to p < 0.005, (1) what proportion of foot and ankle RCTs would be reclassified as no-difference trials under a stricter p value threshold, and (2) how much larger would studies have needed to be to retain or obtain 80% power at the p < 0.005 level?
METHODS
We manually reviewed all articles published between 2019 and 2024 in the top 10 ranked orthopaedic journals and the top three foot and ankle-specific journals, both selected based on their 2023 two-year journal Impact Factor, focusing on foot and ankle studies. Studies were included if they met the following criteria: (1) RCT design, (2) focus on foot and ankle conditions or interventions, (3) published in English, and (4) reported p values for primary outcomes. After screening, a total of 123 RCTs met these criteria and were included in the final analysis. Those studies' p values for primary endpoints were extracted and analyzed under both thresholds. If a study had multiple primary endpoints or evaluated the primary endpoint from multiple domains, all p values were included. We categorized p values into three groups based on the classification proposed by Ioannidis: (1) p < 0.005 as "statistically significant," (2) 0.005 ≤ p < 0.05 as "suggestive," and (3) p ≥ 0.05 as "nonsignificant." For studies with sufficient power analysis data, we calculated the required sample size increase needed to maintain 80% statistical power (1 - beta) at an alpha level of 0.005, using the variance reported in the source studies. The effect size (delta) was inferred from the between-group differences reported in each study. Additionally, multivariable logistic regression analysis was performed to identify factors associated with maintaining statistical significance under the p < 0.005 threshold.
RESULTS
Among 281 primary endpoints identified from 123 trials, 44% (124 of 281) were statistically significant using the threshold defined in those articles (p < 0.05). Of these significant endpoints, only 42% (52 of 124) of endpoints met the proposed threshold (p < 0.005), whereas 58% (72 of 124) fell between 0.005 and 0.05. Following the classification proposed by Benjamin et al., these endpoints would be reclassified as "suggestive" rather than statistically significant. Overall, only 19% (52 of 281) of all endpoints remained statistically significant under the threshold of 0.005 proposed. Twenty-five percent (31 of 123) of trials maintained statistically significant primary endpoints. Among the 123 trials, 54% (66 of 123) had sufficient power analysis data. Assuming an alpha of 0.005, power of 80%, and effect sizes derived from reported between-group differences, maintaining statistical power under the new threshold would require a mean increase of 69% in the sample size. Logistic regression analysis revealed that extracorporeal shock wave therapy (OR 6.8; p < 0.001) and injection therapy (OR 3.3; p = 0.008) were associated with maintaining significance under the stricter threshold.
CONCLUSION
Adopting a threshold of p < 0.005 would substantially impact the interpretation of published foot and ankle RCTs; using that threshold, more than one-half of published RCTs in foot and ankle surgery would have been reclassified as having only "suggestive" or no-difference findings on one or more primary study endpoints.
CLINICAL RELEVANCE
Lowering the p value threshold to 0.005 would require larger sample sizes, posing feasibility challenges in foot and ankle surgery because of smaller patient populations. While this shift aims to reduce false-positives, it risks excluding meaningful findings from underpowered studies. More importantly, this debate highlights that no single p value threshold is universally appropriate. Instead of rigidly applying 0.05 or 0.005, researchers should adjust thresholds based on study context-allowing more relaxed thresholds for exploratory studies and stricter ones for high-risk interventions.
期刊介绍:
Clinical Orthopaedics and Related Research® is a leading peer-reviewed journal devoted to the dissemination of new and important orthopaedic knowledge.
CORR® brings readers the latest clinical and basic research, along with columns, commentaries, and interviews with authors.