当在肿瘤学中使用机器学习开发临床预测模型时,需要更大的样本量:方法学系统评价。

IF 7.3 2区 医学 Q1 HEALTH CARE SCIENCES & SERVICES
Biruk Tsegaye , Kym I.E. Snell , Lucinda Archer , Shona Kirtley , Richard D. Riley , Matthew Sperrin , Ben Van Calster , Gary S. Collins , Paula Dhiman
{"title":"当在肿瘤学中使用机器学习开发临床预测模型时,需要更大的样本量:方法学系统评价。","authors":"Biruk Tsegaye ,&nbsp;Kym I.E. Snell ,&nbsp;Lucinda Archer ,&nbsp;Shona Kirtley ,&nbsp;Richard D. Riley ,&nbsp;Matthew Sperrin ,&nbsp;Ben Van Calster ,&nbsp;Gary S. Collins ,&nbsp;Paula Dhiman","doi":"10.1016/j.jclinepi.2025.111675","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Objectives</h3><div>Having a sufficient sample size is crucial when developing a clinical prediction model. We reviewed details of sample size in studies developing prediction models for binary outcomes using machine learning (ML) methods within oncology and compared the sample size used to develop the models with the minimum required sample size needed when developing a regression-based model (N<sub>min</sub>).</div></div><div><h3>Methods</h3><div>We searched the Medline (via OVID) database for studies developing a prediction model using ML methods published in December 2022. We reviewed how sample size was justified. We calculated N<sub>min</sub>, which is the N<sub>min</sub>, and compared this with the sample size that was used to develop the models.</div></div><div><h3>Results</h3><div>Only one of 36 included studies justified their sample size. We were able to calculate N<sub>min</sub> for 17 (47%) studies. 5/17 studies met N<sub>min</sub>, allowing to precisely estimate the overall risk and minimize overfitting. There was a median deficit of 302 participants with the event (<em>n</em> = 17; range: −21,331 to 2298) when developing the ML models. An additional three out of the 17 studies met the required sample size to precisely estimate the overall risk only.</div></div><div><h3>Conclusion</h3><div>Studies developing a prediction model using ML in oncology seldom justified their sample size and sample sizes were often smaller than N<sub>min</sub>. As ML models almost certainly require a larger sample size than regression models, the deficit is likely larger. We recommend that researchers consider and report their sample size and at least meet the minimum sample size required when developing a regression-based model.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"180 ","pages":"Article 111675"},"PeriodicalIF":7.3000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Larger sample sizes are needed when developing a clinical prediction model using machine learning in oncology: methodological systematic review\",\"authors\":\"Biruk Tsegaye ,&nbsp;Kym I.E. Snell ,&nbsp;Lucinda Archer ,&nbsp;Shona Kirtley ,&nbsp;Richard D. Riley ,&nbsp;Matthew Sperrin ,&nbsp;Ben Van Calster ,&nbsp;Gary S. Collins ,&nbsp;Paula Dhiman\",\"doi\":\"10.1016/j.jclinepi.2025.111675\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background and Objectives</h3><div>Having a sufficient sample size is crucial when developing a clinical prediction model. We reviewed details of sample size in studies developing prediction models for binary outcomes using machine learning (ML) methods within oncology and compared the sample size used to develop the models with the minimum required sample size needed when developing a regression-based model (N<sub>min</sub>).</div></div><div><h3>Methods</h3><div>We searched the Medline (via OVID) database for studies developing a prediction model using ML methods published in December 2022. We reviewed how sample size was justified. We calculated N<sub>min</sub>, which is the N<sub>min</sub>, and compared this with the sample size that was used to develop the models.</div></div><div><h3>Results</h3><div>Only one of 36 included studies justified their sample size. We were able to calculate N<sub>min</sub> for 17 (47%) studies. 5/17 studies met N<sub>min</sub>, allowing to precisely estimate the overall risk and minimize overfitting. There was a median deficit of 302 participants with the event (<em>n</em> = 17; range: −21,331 to 2298) when developing the ML models. An additional three out of the 17 studies met the required sample size to precisely estimate the overall risk only.</div></div><div><h3>Conclusion</h3><div>Studies developing a prediction model using ML in oncology seldom justified their sample size and sample sizes were often smaller than N<sub>min</sub>. As ML models almost certainly require a larger sample size than regression models, the deficit is likely larger. We recommend that researchers consider and report their sample size and at least meet the minimum sample size required when developing a regression-based model.</div></div>\",\"PeriodicalId\":51079,\"journal\":{\"name\":\"Journal of Clinical Epidemiology\",\"volume\":\"180 \",\"pages\":\"Article 111675\"},\"PeriodicalIF\":7.3000,\"publicationDate\":\"2025-01-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Clinical Epidemiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0895435625000083\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895435625000083","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

背景:在建立临床预测模型时,有足够的样本量是至关重要的。我们回顾了在肿瘤学中使用机器学习(ML)方法开发二元结果预测模型的研究中样本量的细节,并将用于开发模型的样本量与开发基于回归的模型(Nmin)所需的最小样本量进行了比较。方法:我们在Medline(通过OVID)数据库中检索了2022年12月发表的使用ML方法开发预测模型的研究。我们回顾了样本量的合理性。我们计算了Nmin,这是开发基于回归的模型所需的最小样本量,并将其与用于开发模型的样本量进行了比较。结果:36项纳入的研究中只有一项证明了其样本量的合理性。我们能够计算17项(47%)研究的Nmin。5/17项研究符合Nmin,允许精确估计总体风险并最大限度地减少过拟合。该事件中有302名参与者存在中位数缺陷(n= 17;范围:-21331至2298)开发ML模型时。在17项研究中,另外3项研究满足了精确估计总体风险所需的样本量。结论:利用肿瘤ML建立预测模型的研究很少证明其样本量是合理的,而且样本量通常小于Nmin。由于ML模型几乎肯定需要比回归模型更大的样本量,因此赤字可能更大。我们建议研究人员考虑并报告他们的样本量,并且在开发基于回归的模型时至少满足所需的最小样本量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Larger sample sizes are needed when developing a clinical prediction model using machine learning in oncology: methodological systematic review

Background and Objectives

Having a sufficient sample size is crucial when developing a clinical prediction model. We reviewed details of sample size in studies developing prediction models for binary outcomes using machine learning (ML) methods within oncology and compared the sample size used to develop the models with the minimum required sample size needed when developing a regression-based model (Nmin).

Methods

We searched the Medline (via OVID) database for studies developing a prediction model using ML methods published in December 2022. We reviewed how sample size was justified. We calculated Nmin, which is the Nmin, and compared this with the sample size that was used to develop the models.

Results

Only one of 36 included studies justified their sample size. We were able to calculate Nmin for 17 (47%) studies. 5/17 studies met Nmin, allowing to precisely estimate the overall risk and minimize overfitting. There was a median deficit of 302 participants with the event (n = 17; range: −21,331 to 2298) when developing the ML models. An additional three out of the 17 studies met the required sample size to precisely estimate the overall risk only.

Conclusion

Studies developing a prediction model using ML in oncology seldom justified their sample size and sample sizes were often smaller than Nmin. As ML models almost certainly require a larger sample size than regression models, the deficit is likely larger. We recommend that researchers consider and report their sample size and at least meet the minimum sample size required when developing a regression-based model.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Clinical Epidemiology
Journal of Clinical Epidemiology 医学-公共卫生、环境卫生与职业卫生
CiteScore
12.00
自引率
6.90%
发文量
320
审稿时长
44 days
期刊介绍: The Journal of Clinical Epidemiology strives to enhance the quality of clinical and patient-oriented healthcare research by advancing and applying innovative methods in conducting, presenting, synthesizing, disseminating, and translating research results into optimal clinical practice. Special emphasis is placed on training new generations of scientists and clinical practice leaders.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信