A Machine Learning Algorithm With an Oversampling Technique in Limited Data Scenarios for the Prediction of Present and Future Restorative Treatment Need: Development and Validation Study.

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-08-28 DOI:10.2196/75117

Elina Väyrynen, Otso Tirkkonen, Henna Tiensuu, Jaakko Suutala, Vuokko Anttonen, Marja-Liisa Laitala, Katri Kukkola, Saujanya Karki

{"title":"A Machine Learning Algorithm With an Oversampling Technique in Limited Data Scenarios for the Prediction of Present and Future Restorative Treatment Need: Development and Validation Study.","authors":"Elina Väyrynen, Otso Tirkkonen, Henna Tiensuu, Jaakko Suutala, Vuokko Anttonen, Marja-Liisa Laitala, Katri Kukkola, Saujanya Karki","doi":"10.2196/75117","DOIUrl":null,"url":null,"abstract":"Background: Untreated dental caries is the most common health condition worldwide. Therefore, new strategies need to be developed to reduce the manifestations of dental caries.Objective: This study aimed to develop and test a machine learning (ML) algorithm for detecting present and predicting future carious lesions in the adolescent population using a set of easy-to-collect predictive variables. In addition, this study aimed to deal with an imbalanced and small dataset using an oversampling method.Methods: This population-based study was conducted among secondary schoolchildren, aged between 13 and 17 years, from the northern parts of Finland in 2022. After meeting the inclusion criteria, a total of 218 participants were included in this study. The inclusion criteria consisted of participants having completed a web-based risk assessment questionnaire and having undergone a clinical examination at public health care services. Dental caries (International Caries Detection and Assessment System [ICDAS] scores of 4, 5, and 6; ie, ICDAS 4-6) and active initial caries (ICDAS 2+, 3+) were considered as outcomes. Several predictors, such as behavioral and dietary habits, were included. An extreme gradient boosting model was developed, tested, and assessed for its predictive performance. A 4-fold cross-validation was performed using the nested resampling technique. The random oversampling examples method and the k-nearest neighbors classifiers were used for all 4 folds. The mean (SD) performance of all the folds was computed.Results: Dental caries (ICDAS 2+,3+,4-6) were prevalent in 65.6% (143/218) of the participants. The mean area under the curve was 0.77 (SD 0.04) and the mean F1-score was 0.82 (SD 0.06) for the extreme gradient boosting model. Similarly, the mean area under the curve and mean F1-scores after oversampling were 0.74 (SD 0.05) and 0.79 (SD 0.04), respectively. The Shapley additive explanation values were calculated for all 4 folds to assess feature importance, revealing that previous dental fillings were the feature most strongly associated with the need for restorative treatment.Conclusions: On the basis of the performance metrics, the ML algorithm developed and tested in this study can be considered good. The ML algorithm could serve as a cost-effective screening tool for dental professionals to identify the risk of future restorative treatment needs. However, future studies with longitudinal cohorts and longitudinal data, along with external validation for generalizability, are needed to validate our model.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":" ","pages":"e75117"},"PeriodicalIF":3.8000,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12426571/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/75117","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Untreated dental caries is the most common health condition worldwide. Therefore, new strategies need to be developed to reduce the manifestations of dental caries.

Objective: This study aimed to develop and test a machine learning (ML) algorithm for detecting present and predicting future carious lesions in the adolescent population using a set of easy-to-collect predictive variables. In addition, this study aimed to deal with an imbalanced and small dataset using an oversampling method.

Methods: This population-based study was conducted among secondary schoolchildren, aged between 13 and 17 years, from the northern parts of Finland in 2022. After meeting the inclusion criteria, a total of 218 participants were included in this study. The inclusion criteria consisted of participants having completed a web-based risk assessment questionnaire and having undergone a clinical examination at public health care services. Dental caries (International Caries Detection and Assessment System [ICDAS] scores of 4, 5, and 6; ie, ICDAS 4-6) and active initial caries (ICDAS 2+, 3+) were considered as outcomes. Several predictors, such as behavioral and dietary habits, were included. An extreme gradient boosting model was developed, tested, and assessed for its predictive performance. A 4-fold cross-validation was performed using the nested resampling technique. The random oversampling examples method and the k-nearest neighbors classifiers were used for all 4 folds. The mean (SD) performance of all the folds was computed.

Results: Dental caries (ICDAS 2+,3+,4-6) were prevalent in 65.6% (143/218) of the participants. The mean area under the curve was 0.77 (SD 0.04) and the mean F₁-score was 0.82 (SD 0.06) for the extreme gradient boosting model. Similarly, the mean area under the curve and mean F₁-scores after oversampling were 0.74 (SD 0.05) and 0.79 (SD 0.04), respectively. The Shapley additive explanation values were calculated for all 4 folds to assess feature importance, revealing that previous dental fillings were the feature most strongly associated with the need for restorative treatment.

Conclusions: On the basis of the performance metrics, the ML algorithm developed and tested in this study can be considered good. The ML algorithm could serve as a cost-effective screening tool for dental professionals to identify the risk of future restorative treatment needs. However, future studies with longitudinal cohorts and longitudinal data, along with external validation for generalizability, are needed to validate our model.

Abstract Image

查看原文本刊更多论文

基于过采样技术的有限数据场景下机器学习算法的开发与验证，用于预测当前和未来的恢复性治疗需求。

背景：未经治疗的龋齿是全球最常见的健康状况。因此，需要制定新的策略来减少龋齿的表现。目的：本研究的目的是开发和测试一种机器学习（ML）算法，利用一组易于收集的预测变量来检测青少年人群中当前和预测未来的龋齿病变。此外，另一个目标是用过采样方法处理不平衡的小数据集。方法：这项基于人群的研究于2022年在芬兰北部地区13-17岁的中学生中进行。在满足纳入标准后，本研究共纳入n=218名受试者。纳入标准包括参与者完成了基于网络的风险评估问卷，并在公共医疗服务机构接受了临床检查。结果为龋病（ICDAS4-6）和活动性初始龋病（ICDAS2+,3+）。包括行为和饮食习惯等几个预测因素。开发了极端梯度增强（XGBoost）模型，并对其预测性能进行了测试和评估。采用嵌套重采样技术进行4倍交叉验证（CV）。随机过采样（ROSE）方法和k近邻（KNN）分类器被用于所有四种折叠。计算了所有折叠的平均（SD）性能。结果：龋患病率143/218，占65.6% （ICDAS2+,3+,4-6）。XGBoost模型的平均（SD）曲线下面积（AUC）为0.77(0.04)，平均（SD） f1评分为0.82（0.06）。同样，过采样后的平均（SD） AUC和平均（SD） f1评分分别为0.74（0.05）和0.79（0.04）。SHapley加性解释（SHAP）值对所有四种折叠进行计算，以评估特征的重要性，揭示先前的牙齿填充物是与修复治疗需求最密切相关的特征。结论：基于性能指标，本研究开发和测试的ML算法可以被认为是好的。ML算法可以作为一种具有成本效益的筛查工具，用于牙科专业人员识别未来修复治疗需求的风险。然而，未来的纵向队列研究和纵向数据，以及外部验证的普遍性，需要验证我们的模型。临床试验:

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.