Biruk Tsegaye , Kym I.E. Snell , Lucinda Archer , Shona Kirtley , Richard D. Riley , Matthew Sperrin , Ben Van Calster , Gary S. Collins , Paula Dhiman
{"title":"当在肿瘤学中使用机器学习开发临床预测模型时,需要更大的样本量:方法学系统评价。","authors":"Biruk Tsegaye , Kym I.E. Snell , Lucinda Archer , Shona Kirtley , Richard D. Riley , Matthew Sperrin , Ben Van Calster , Gary S. Collins , Paula Dhiman","doi":"10.1016/j.jclinepi.2025.111675","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Objectives</h3><div>Having a sufficient sample size is crucial when developing a clinical prediction model. We reviewed details of sample size in studies developing prediction models for binary outcomes using machine learning (ML) methods within oncology and compared the sample size used to develop the models with the minimum required sample size needed when developing a regression-based model (N<sub>min</sub>).</div></div><div><h3>Methods</h3><div>We searched the Medline (via OVID) database for studies developing a prediction model using ML methods published in December 2022. We reviewed how sample size was justified. We calculated N<sub>min</sub>, which is the N<sub>min</sub>, and compared this with the sample size that was used to develop the models.</div></div><div><h3>Results</h3><div>Only one of 36 included studies justified their sample size. We were able to calculate N<sub>min</sub> for 17 (47%) studies. 5/17 studies met N<sub>min</sub>, allowing to precisely estimate the overall risk and minimize overfitting. There was a median deficit of 302 participants with the event (<em>n</em> = 17; range: −21,331 to 2298) when developing the ML models. An additional three out of the 17 studies met the required sample size to precisely estimate the overall risk only.</div></div><div><h3>Conclusion</h3><div>Studies developing a prediction model using ML in oncology seldom justified their sample size and sample sizes were often smaller than N<sub>min</sub>. As ML models almost certainly require a larger sample size than regression models, the deficit is likely larger. We recommend that researchers consider and report their sample size and at least meet the minimum sample size required when developing a regression-based model.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"180 ","pages":"Article 111675"},"PeriodicalIF":7.3000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Larger sample sizes are needed when developing a clinical prediction model using machine learning in oncology: methodological systematic review\",\"authors\":\"Biruk Tsegaye , Kym I.E. Snell , Lucinda Archer , Shona Kirtley , Richard D. Riley , Matthew Sperrin , Ben Van Calster , Gary S. Collins , Paula Dhiman\",\"doi\":\"10.1016/j.jclinepi.2025.111675\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background and Objectives</h3><div>Having a sufficient sample size is crucial when developing a clinical prediction model. We reviewed details of sample size in studies developing prediction models for binary outcomes using machine learning (ML) methods within oncology and compared the sample size used to develop the models with the minimum required sample size needed when developing a regression-based model (N<sub>min</sub>).</div></div><div><h3>Methods</h3><div>We searched the Medline (via OVID) database for studies developing a prediction model using ML methods published in December 2022. We reviewed how sample size was justified. We calculated N<sub>min</sub>, which is the N<sub>min</sub>, and compared this with the sample size that was used to develop the models.</div></div><div><h3>Results</h3><div>Only one of 36 included studies justified their sample size. We were able to calculate N<sub>min</sub> for 17 (47%) studies. 5/17 studies met N<sub>min</sub>, allowing to precisely estimate the overall risk and minimize overfitting. There was a median deficit of 302 participants with the event (<em>n</em> = 17; range: −21,331 to 2298) when developing the ML models. An additional three out of the 17 studies met the required sample size to precisely estimate the overall risk only.</div></div><div><h3>Conclusion</h3><div>Studies developing a prediction model using ML in oncology seldom justified their sample size and sample sizes were often smaller than N<sub>min</sub>. As ML models almost certainly require a larger sample size than regression models, the deficit is likely larger. We recommend that researchers consider and report their sample size and at least meet the minimum sample size required when developing a regression-based model.</div></div>\",\"PeriodicalId\":51079,\"journal\":{\"name\":\"Journal of Clinical Epidemiology\",\"volume\":\"180 \",\"pages\":\"Article 111675\"},\"PeriodicalIF\":7.3000,\"publicationDate\":\"2025-01-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Clinical Epidemiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0895435625000083\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895435625000083","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Larger sample sizes are needed when developing a clinical prediction model using machine learning in oncology: methodological systematic review
Background and Objectives
Having a sufficient sample size is crucial when developing a clinical prediction model. We reviewed details of sample size in studies developing prediction models for binary outcomes using machine learning (ML) methods within oncology and compared the sample size used to develop the models with the minimum required sample size needed when developing a regression-based model (Nmin).
Methods
We searched the Medline (via OVID) database for studies developing a prediction model using ML methods published in December 2022. We reviewed how sample size was justified. We calculated Nmin, which is the Nmin, and compared this with the sample size that was used to develop the models.
Results
Only one of 36 included studies justified their sample size. We were able to calculate Nmin for 17 (47%) studies. 5/17 studies met Nmin, allowing to precisely estimate the overall risk and minimize overfitting. There was a median deficit of 302 participants with the event (n = 17; range: −21,331 to 2298) when developing the ML models. An additional three out of the 17 studies met the required sample size to precisely estimate the overall risk only.
Conclusion
Studies developing a prediction model using ML in oncology seldom justified their sample size and sample sizes were often smaller than Nmin. As ML models almost certainly require a larger sample size than regression models, the deficit is likely larger. We recommend that researchers consider and report their sample size and at least meet the minimum sample size required when developing a regression-based model.
期刊介绍:
The Journal of Clinical Epidemiology strives to enhance the quality of clinical and patient-oriented healthcare research by advancing and applying innovative methods in conducting, presenting, synthesizing, disseminating, and translating research results into optimal clinical practice. Special emphasis is placed on training new generations of scientists and clinical practice leaders.