Prediction and feature selection of low birth weight using machine learning algorithms.

IF 2.4 3区医学 Q3 ENVIRONMENTAL SCIENCES

Journal of Health, Population, and Nutrition Pub Date : 2024-10-12 DOI:10.1186/s41043-024-00647-8

Tasneem Binte Reza, Nahid Salma

{"title":"Prediction and feature selection of low birth weight using machine learning algorithms.","authors":"Tasneem Binte Reza, Nahid Salma","doi":"10.1186/s41043-024-00647-8","DOIUrl":null,"url":null,"abstract":"Background and aims: The birth weight of a newborn is a crucial factor that affects their overall health and future well-being. Low birth weight (LBW) is a widespread global issue, which the World Health Organization defines as weighing less than 2,500 g. LBW can have severe negative consequences on an individual's health, including neonatal mortality and various health concerns throughout their life. To address this problem, this study has been conducted using BDHS 2017-2018 data to uncover important aspects of LBW using a variety of machine learning (ML) approaches and to determine the best feature selection technique and best predictive ML model.Methods: To pick out the key features, the Boruta algorithm and wrapper method were used. Logistic Regression (LR) used as traditional method and several machine learning classifiers were then used, including, DT (Decision Tree), SVM (Support Vector Machine), NB (Naïve Bayes), RF (Random Forest), XGBoost (eXtreme Gradient Boosting), and AdaBoost (Adaptive Boosting), to determine the best model for predicting LBW. The model's performance was evaluated based on the specificity, sensitivity, accuracy, F1 score and AUC value.Results: Result shows, Boruta algorithm identifies eleven significant features including respondent's age, highest education level, educational attainment, wealth index, age at first birth, weight, height, BMI, age at first sexual intercourse, birth order number, and whether the child is a twin. Incorporating Boruta algorithm's significant features, the performance of traditional LR and ML methods including DT, SVM, NB, RF, XGBoost, and AB were evaluated where LR, had a specificity, sensitivity, accuracy and F1 score of 0.85, 0.5, 85.15% and 0.915. While the ML methods DT, SVM, NB, RF, XGBoost, and AB model's respective accuracy values were 85.35%, 85.15%, 84.54%, 81.18%, and 84.41%. Based on the specificity, sensitivity, accuracy, F1 score and AUC, RF (specificity = 0.99, sensitivity = 0.58, accuracy = 85.86%, F1 score = 0.9243, AUC = 0.549) outperformed the other methods. Both the classical (LR) and machine learning (ML) models' performance has improved dramatically when important characteristics are extracted using the wrapper method. The LR method identified five significant features with a specificity, sensitivity, accuracy and F1 score of 0.87, 0.33, 87.12% and 0.9309. The region, whether the infant is a twin, and cesarean delivery were the three key features discovered by the DT and RF models, which were implemented using the wrapper technique. All three models had the identical F1 score of 0.9318. However, \"child is twin\" was recognized as a significant feature by the SVM, NB, and AB models, with an F1 score of 0.9315. Ultimately, with an F1 score of 0.9315, the XGBoost model recognized \"child is twin\" and \"age at first sex\" as relevant features. Random Forest again beat the other approaches in this instance.Conclusions: The study reveals Wrapper method as the optimal feature selection technique. The ML method outperforms traditional methods, with Random Forest (RF) being the most effective predictive model for Low-Birth-Weight prediction. The study suggests that policymakers in Bangladesh can mitigate low birth weight newborns by considering identified risk factors.","PeriodicalId":15969,"journal":{"name":"Journal of Health, Population, and Nutrition","volume":"43 1","pages":"157"},"PeriodicalIF":2.4000,"publicationDate":"2024-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471022/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Health, Population, and Nutrition","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s41043-024-00647-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Background and aims: The birth weight of a newborn is a crucial factor that affects their overall health and future well-being. Low birth weight (LBW) is a widespread global issue, which the World Health Organization defines as weighing less than 2,500 g. LBW can have severe negative consequences on an individual's health, including neonatal mortality and various health concerns throughout their life. To address this problem, this study has been conducted using BDHS 2017-2018 data to uncover important aspects of LBW using a variety of machine learning (ML) approaches and to determine the best feature selection technique and best predictive ML model.

Methods: To pick out the key features, the Boruta algorithm and wrapper method were used. Logistic Regression (LR) used as traditional method and several machine learning classifiers were then used, including, DT (Decision Tree), SVM (Support Vector Machine), NB (Naïve Bayes), RF (Random Forest), XGBoost (eXtreme Gradient Boosting), and AdaBoost (Adaptive Boosting), to determine the best model for predicting LBW. The model's performance was evaluated based on the specificity, sensitivity, accuracy, F1 score and AUC value.

Results: Result shows, Boruta algorithm identifies eleven significant features including respondent's age, highest education level, educational attainment, wealth index, age at first birth, weight, height, BMI, age at first sexual intercourse, birth order number, and whether the child is a twin. Incorporating Boruta algorithm's significant features, the performance of traditional LR and ML methods including DT, SVM, NB, RF, XGBoost, and AB were evaluated where LR, had a specificity, sensitivity, accuracy and F1 score of 0.85, 0.5, 85.15% and 0.915. While the ML methods DT, SVM, NB, RF, XGBoost, and AB model's respective accuracy values were 85.35%, 85.15%, 84.54%, 81.18%, and 84.41%. Based on the specificity, sensitivity, accuracy, F1 score and AUC, RF (specificity = 0.99, sensitivity = 0.58, accuracy = 85.86%, F1 score = 0.9243, AUC = 0.549) outperformed the other methods. Both the classical (LR) and machine learning (ML) models' performance has improved dramatically when important characteristics are extracted using the wrapper method. The LR method identified five significant features with a specificity, sensitivity, accuracy and F1 score of 0.87, 0.33, 87.12% and 0.9309. The region, whether the infant is a twin, and cesarean delivery were the three key features discovered by the DT and RF models, which were implemented using the wrapper technique. All three models had the identical F1 score of 0.9318. However, "child is twin" was recognized as a significant feature by the SVM, NB, and AB models, with an F1 score of 0.9315. Ultimately, with an F1 score of 0.9315, the XGBoost model recognized "child is twin" and "age at first sex" as relevant features. Random Forest again beat the other approaches in this instance.

Conclusions: The study reveals Wrapper method as the optimal feature selection technique. The ML method outperforms traditional methods, with Random Forest (RF) being the most effective predictive model for Low-Birth-Weight prediction. The study suggests that policymakers in Bangladesh can mitigate low birth weight newborns by considering identified risk factors.

查看原文本刊更多论文

利用机器学习算法预测出生体重不足并进行特征选择。

背景和目的：新生儿的出生体重是影响其整体健康和未来福祉的关键因素。低出生体重（LBW）是一个普遍的全球性问题，世界卫生组织将其定义为体重低于 2 500 克。低出生体重会对个人健康造成严重的负面影响，包括新生儿死亡和一生中的各种健康问题。为解决这一问题，本研究利用 BDHS 2017-2018 年数据，采用多种机器学习（ML）方法揭示枸杞体重不足的重要方面，并确定最佳特征选择技术和最佳预测 ML 模型：为了选出关键特征，使用了 Boruta 算法和包装方法。传统方法使用逻辑回归（LR），然后使用几种机器学习分类器，包括 DT（决策树）、SVM（支持向量机）、NB（奈夫贝叶斯）、RF（随机森林）、XGBoost（极梯度提升）和 AdaBoost（自适应提升），以确定预测枸杞体重的最佳模型。根据特异性、灵敏度、准确性、F1 分数和 AUC 值对模型的性能进行了评估：结果显示，Boruta 算法识别出了 11 个重要特征，包括受访者的年龄、最高教育水平、受教育程度、财富指数、首次生育年龄、体重、身高、体重指数、首次性交年龄、出生顺序号以及孩子是否为双胞胎。结合 Boruta 算法的重要特征，评估了传统 LR 和 ML 方法（包括 DT、SVM、NB、RF、XGBoost 和 AB）的性能，其中 LR 的特异性、灵敏度、准确性和 F1 分数分别为 0.85、0.5、85.15% 和 0.915。而 ML 方法 DT、SVM、NB、RF、XGBoost 和 AB 模型的准确度值分别为 85.35%、85.15%、84.54%、81.18% 和 84.41%。从特异性、灵敏度、准确度、F1 分数和 AUC 来看，RF（特异性 = 0.99，灵敏度 = 0.58，准确度 = 85.86%，F1 分数 = 0.9243，AUC = 0.549）优于其他方法。当使用包装方法提取重要特征时，经典（LR）和机器学习（ML）模型的性能都有显著提高。LR 方法识别出五个重要特征，其特异性、灵敏度、准确性和 F1 分数分别为 0.87、0.33、87.12% 和 0.9309。地区、婴儿是否为双胞胎和剖腹产是 DT 和 RF 模型发现的三个关键特征，这两个模型是用包装技术实现的。三个模型的 F1 得分均为 0.9318。然而，SVM、NB 和 AB 模型都认为 "孩子是双胞胎 "是一个重要特征，其 F1 得分为 0.9315。最终，XGBoost 模型以 0.9315 的 F1 分数将 "孩子是双胞胎 "和 "初次性行为年龄 "识别为相关特征。在这个例子中，随机森林再次击败了其他方法：研究表明，Wrapper 方法是最佳的特征选择技术。ML 方法优于传统方法，而随机森林（RF）是低出生体重预测中最有效的预测模型。研究表明，孟加拉国的政策制定者可以通过考虑已识别的风险因素来减轻新生儿出生体重过低的问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Health, Population, and Nutrition 医学-公共卫生、环境卫生与职业卫生

CiteScore

2.20

自引率

0.00%

发文量

审稿时长

6 months

期刊介绍： Journal of Health, Population and Nutrition brings together research on all aspects of issues related to population, nutrition and health. The journal publishes articles across a broad range of topics including global health, maternal and child health, nutrition, common illnesses and determinants of population health.