Predicting HIV testing status in pregnant women using balanced machine learning models: Insights from Sierra Leone's demographic health survey

Decoding Infection and Transmission Pub Date : 2026-01-01 Epub Date: 2026-02-24 DOI:10.1016/j.dcit.2026.100078

Afeez A. Soladoye , David B. Olawade , Oluwakemi Jumoke Bello , Claret Chinenyenwa Analikwu , Raphael Igbarumah Ayo Daniel , Augustus Osborne

{"title":"Predicting HIV testing status in pregnant women using balanced machine learning models: Insights from Sierra Leone's demographic health survey","authors":"Afeez A. Soladoye , David B. Olawade , Oluwakemi Jumoke Bello , Claret Chinenyenwa Analikwu , Raphael Igbarumah Ayo Daniel , Augustus Osborne","doi":"10.1016/j.dcit.2026.100078","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Preventing vertical HIV transmission requires comprehensive testing programmes for pregnant women, yet coverage gaps persist across Sub-Saharan Africa. In Sierra Leone, approximately one-third of pregnant women remain untested for HIV, creating substantial public health challenges. Conventional predictive models often exhibit bias towards majority classes in imbalanced datasets, hindering accurate identification of untested women who require urgent intervention. This study addresses the critical need for diagnostic prediction models that can reliably identify pregnant women at risk of not being tested for HIV. This study develops and validates diagnostic machine learning prediction models to identify HIV testing patterns among pregnant women in Sierra Leone, emphasising class balance techniques to enhance minority class detection capabilities and improve targeted intervention strategies.</div></div><div><h3>Methods</h3><div>We analysed data from 990 pregnant women (aged 15-49) using the 2019 Sierra Leone Demographic and Health Survey. Our preprocessing pipeline included categorical variable encoding, feature normalisation via Min-Max scaling, and implementation of Synthetic Minority Oversampling Technique (SMOTE) for dataset balancing. Model development employed four supervised learning algorithms: Random Forest, XGBoost, Logistic Regression, and K-Nearest Neighbors. Model performance was evaluated using macro-averaged metrics including precision, recall, F1-score, and accuracy, with 70-30 train-test split validation.</div></div><div><h3>Results</h3><div>Imbalanced dataset models demonstrated suboptimal performance with macro F1-scores between 0.46 and 0.57. Following SMOTE implementation, diagnostic performance improved substantially to 0.55-0.72. Random Forest achieved optimal macro F1-score (0.72), representing 56% improvement over standard approaches.</div></div><div><h3>Conclusions</h3><div>Class imbalance mitigation through SMOTE substantially enhances diagnostic prediction model performance for HIV testing status classification, facilitating targeted public health strategies in resource-constrained environments.</div></div>","PeriodicalId":100358,"journal":{"name":"Decoding Infection and Transmission","volume":"4 ","pages":"Article 100078"},"PeriodicalIF":0.0000,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decoding Infection and Transmission","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949924026000054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/24 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Preventing vertical HIV transmission requires comprehensive testing programmes for pregnant women, yet coverage gaps persist across Sub-Saharan Africa. In Sierra Leone, approximately one-third of pregnant women remain untested for HIV, creating substantial public health challenges. Conventional predictive models often exhibit bias towards majority classes in imbalanced datasets, hindering accurate identification of untested women who require urgent intervention. This study addresses the critical need for diagnostic prediction models that can reliably identify pregnant women at risk of not being tested for HIV. This study develops and validates diagnostic machine learning prediction models to identify HIV testing patterns among pregnant women in Sierra Leone, emphasising class balance techniques to enhance minority class detection capabilities and improve targeted intervention strategies.

Methods

We analysed data from 990 pregnant women (aged 15-49) using the 2019 Sierra Leone Demographic and Health Survey. Our preprocessing pipeline included categorical variable encoding, feature normalisation via Min-Max scaling, and implementation of Synthetic Minority Oversampling Technique (SMOTE) for dataset balancing. Model development employed four supervised learning algorithms: Random Forest, XGBoost, Logistic Regression, and K-Nearest Neighbors. Model performance was evaluated using macro-averaged metrics including precision, recall, F1-score, and accuracy, with 70-30 train-test split validation.

Results

Imbalanced dataset models demonstrated suboptimal performance with macro F1-scores between 0.46 and 0.57. Following SMOTE implementation, diagnostic performance improved substantially to 0.55-0.72. Random Forest achieved optimal macro F1-score (0.72), representing 56% improvement over standard approaches.

Conclusions

Class imbalance mitigation through SMOTE substantially enhances diagnostic prediction model performance for HIV testing status classification, facilitating targeted public health strategies in resource-constrained environments.

Abstract Image

查看原文本刊更多论文

使用平衡机器学习模型预测孕妇的艾滋病毒检测状况：来自塞拉利昂人口健康调查的见解

预防艾滋病毒垂直传播需要对孕妇进行全面的检测规划，但撒哈拉以南非洲地区的覆盖率差距仍然存在。在塞拉利昂，大约三分之一的孕妇仍未接受艾滋病毒检测，这对公共卫生构成重大挑战。在不平衡的数据集中，传统的预测模型往往对大多数类别表现出偏见，阻碍了对需要紧急干预的未经检测的妇女的准确识别。这项研究解决了诊断预测模型的关键需求，该模型可以可靠地识别未接受艾滋病毒检测的孕妇的风险。本研究开发并验证了诊断机器学习预测模型，以确定塞拉利昂孕妇的艾滋病毒检测模式，强调班级平衡技术以增强少数班级检测能力并改进有针对性的干预策略。方法利用2019年塞拉利昂人口与健康调查对990名孕妇（15-49岁）的数据进行分析。我们的预处理管道包括分类变量编码，通过最小-最大缩放的特征归一化，以及用于数据集平衡的合成少数过采样技术（SMOTE）的实现。模型开发采用了四种监督学习算法：随机森林、XGBoost、逻辑回归和k近邻。模型性能使用宏观平均指标进行评估，包括精度、召回率、f1分数和准确性，采用70-30训练测试分割验证。结果simbalanced数据集模型表现为次优性能，宏观f1得分在0.46 ~ 0.57之间。实施SMOTE后，诊断性能大幅提高至0.55-0.72。随机森林获得了最优的宏观f1得分（0.72），比标准方法提高了56%。结论通过SMOTE缓解类失衡大大提高了HIV检测状态分类诊断预测模型的性能，有助于在资源受限环境中制定有针对性的公共卫生策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Decoding Infection and Transmission

自引率

0.00%

发文量