Hongyu Lai, Kaiye Gao, Meiyan Li, Tao Li, Xiaodong Zhou, Xingtao Zhou, Hui Guo, Bo Fu
{"title":"处理早发近视风险预测模型的数据缺失和测量误差。","authors":"Hongyu Lai, Kaiye Gao, Meiyan Li, Tao Li, Xiaodong Zhou, Xingtao Zhou, Hui Guo, Bo Fu","doi":"10.1186/s12874-024-02319-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Early identification of children at high risk of developing myopia is essential to prevent myopia progression by introducing timely interventions. However, missing data and measurement error (ME) are common challenges in risk prediction modelling that can introduce bias in myopia prediction.</p><p><strong>Methods: </strong>We explore four imputation methods to address missing data and ME: single imputation (SI), multiple imputation under missing at random (MI-MAR), multiple imputation with calibration procedure (MI-ME), and multiple imputation under missing not at random (MI-MNAR). We compare four machine-learning models (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical models (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression) in myopia risk prediction. We apply these models to the Shanghai Jinshan Myopia Cohort Study and also conduct a simulation study to investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on model performance. Model performance is evaluated using the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC).</p><p><strong>Results: </strong>Our findings indicate that in scenarios with missing data and ME, using MI-ME in combination with logistic regression yields the best prediction results. In scenarios without ME, employing MI-MAR to handle missing data outperforms SI regardless of the missing mechanisms. When ME has a greater impact on prediction than missing data, the relative advantage of MI-MAR diminishes, and MI-ME becomes more superior. Furthermore, our results demonstrate that statistical models exhibit better prediction performance than machine-learning models.</p><p><strong>Conclusion: </strong>MI-ME emerges as a reliable method for handling missing data and ME in important predictors for early-onset myopia risk prediction.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11378546/pdf/","citationCount":"0","resultStr":"{\"title\":\"Handling missing data and measurement error for early-onset myopia risk prediction models.\",\"authors\":\"Hongyu Lai, Kaiye Gao, Meiyan Li, Tao Li, Xiaodong Zhou, Xingtao Zhou, Hui Guo, Bo Fu\",\"doi\":\"10.1186/s12874-024-02319-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Early identification of children at high risk of developing myopia is essential to prevent myopia progression by introducing timely interventions. However, missing data and measurement error (ME) are common challenges in risk prediction modelling that can introduce bias in myopia prediction.</p><p><strong>Methods: </strong>We explore four imputation methods to address missing data and ME: single imputation (SI), multiple imputation under missing at random (MI-MAR), multiple imputation with calibration procedure (MI-ME), and multiple imputation under missing not at random (MI-MNAR). We compare four machine-learning models (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical models (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression) in myopia risk prediction. We apply these models to the Shanghai Jinshan Myopia Cohort Study and also conduct a simulation study to investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on model performance. Model performance is evaluated using the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC).</p><p><strong>Results: </strong>Our findings indicate that in scenarios with missing data and ME, using MI-ME in combination with logistic regression yields the best prediction results. In scenarios without ME, employing MI-MAR to handle missing data outperforms SI regardless of the missing mechanisms. When ME has a greater impact on prediction than missing data, the relative advantage of MI-MAR diminishes, and MI-ME becomes more superior. Furthermore, our results demonstrate that statistical models exhibit better prediction performance than machine-learning models.</p><p><strong>Conclusion: </strong>MI-ME emerges as a reliable method for handling missing data and ME in important predictors for early-onset myopia risk prediction.</p>\",\"PeriodicalId\":9114,\"journal\":{\"name\":\"BMC Medical Research Methodology\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2024-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11378546/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Research Methodology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12874-024-02319-x\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-024-02319-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
摘要
背景:早期识别近视高风险儿童对于通过及时干预预防近视发展至关重要。然而,数据缺失和测量误差(ME)是风险预测建模中常见的难题,会给近视预测带来偏差:方法:我们探讨了四种解决数据缺失和测量误差的估算方法:单一估算(SI)、随机缺失下的多重估算(MI-MAR)、带校准程序的多重估算(MI-ME)和非随机缺失下的多重估算(MI-MNAR)。我们比较了四种机器学习模型(决策树、奈夫贝叶斯、随机森林和 Xgboost)和三种统计模型(逻辑回归、逐步逻辑回归和最小绝对收缩与选择算子逻辑回归)在近视风险预测中的应用。我们将这些模型应用于上海金山近视队列研究,并进行了模拟研究,以调查缺失机制、ME程度和预测因子的重要性对模型性能的影响。模型性能使用接收者操作特征曲线(AUROC)和精确度-召回曲线下面积(AUPRC)进行评估:结果:我们的研究结果表明,在有缺失数据和 ME 的情况下,使用 MI-ME 结合逻辑回归可获得最佳预测结果。在没有 ME 的情况下,无论缺失机制如何,使用 MI-MAR 处理缺失数据的效果都优于 SI。当 ME 对预测的影响大于缺失数据时,MI-MAR 的相对优势就会减弱,MI-ME 就会变得更加优越。此外,我们的结果表明,统计模型比机器学习模型表现出更好的预测性能:结论:MI-ME 是一种可靠的方法,可以处理早期近视风险预测中重要预测指标的缺失数据和 ME。
Handling missing data and measurement error for early-onset myopia risk prediction models.
Background: Early identification of children at high risk of developing myopia is essential to prevent myopia progression by introducing timely interventions. However, missing data and measurement error (ME) are common challenges in risk prediction modelling that can introduce bias in myopia prediction.
Methods: We explore four imputation methods to address missing data and ME: single imputation (SI), multiple imputation under missing at random (MI-MAR), multiple imputation with calibration procedure (MI-ME), and multiple imputation under missing not at random (MI-MNAR). We compare four machine-learning models (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical models (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression) in myopia risk prediction. We apply these models to the Shanghai Jinshan Myopia Cohort Study and also conduct a simulation study to investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on model performance. Model performance is evaluated using the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC).
Results: Our findings indicate that in scenarios with missing data and ME, using MI-ME in combination with logistic regression yields the best prediction results. In scenarios without ME, employing MI-MAR to handle missing data outperforms SI regardless of the missing mechanisms. When ME has a greater impact on prediction than missing data, the relative advantage of MI-MAR diminishes, and MI-ME becomes more superior. Furthermore, our results demonstrate that statistical models exhibit better prediction performance than machine-learning models.
Conclusion: MI-ME emerges as a reliable method for handling missing data and ME in important predictors for early-onset myopia risk prediction.
期刊介绍:
BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.