利用2002-2017年南非成人人口调查的证据，机器学习算法在预测艾滋病毒检测中的应用：艾滋病毒检测预测模型。

IF 2.8 4区医学 Q2 INFECTIOUS DISEASES

Tropical Medicine and Infectious Disease Pub Date : 2025-06-14 DOI:10.3390/tropicalmed10060167

Musa Jaiteh, Edith Phalane, Yegnanew A Shiferaw, Haruna Jallow, Refilwe Nancy Phaswana-Mafuya

{"title":"利用2002-2017年南非成人人口调查的证据，机器学习算法在预测艾滋病毒检测中的应用：艾滋病毒检测预测模型。","authors":"Musa Jaiteh, Edith Phalane, Yegnanew A Shiferaw, Haruna Jallow, Refilwe Nancy Phaswana-Mafuya","doi":"10.3390/tropicalmed10060167","DOIUrl":null,"url":null,"abstract":"There is a significant portion of the South African population with unknown HIV status, which slows down epidemic control despite the progress made in HIV testing. Machine learning (ML) has been effective in identifying individuals at higher risk of HIV infection, for whom testing is strongly recommended. However, there are insufficient predictive models to inform targeted HIV testing interventions in South Africa. By harnessing the power of supervised ML (SML) algorithms, this study aimed to identify the most consistent predictors of HIV testing in repeated adult population-based surveys in South Africa. The study employed four SML algorithms, namely, decision trees, random forest, support vector machines (SVM), and logistic regression, across the five cross-sectional cycles of the South African National HIV Prevalence, Incidence, and Behavior and Communication Survey (SABSSM) datasets. The Human Science Research Council (HSRC) conducted the SABSSM surveys and made the datasets available for this study. Each dataset was split into 80% training and 20% testing sets with a 5-fold cross-validation technique. The random forest outperformed the other models across all five datasets with the highest accuracy (80.98%), precision (81.51%), F1-score (80.30%), area under the curve (AUC) (88.31%), and cross-validation average (79.10%) in the 2002 data. Random forest achieved the highest classification performance across all the dates, especially in the 2017 survey. SVM had a high recall (89.12% in 2005, 86.28% in 2008) but lower precision, leading to a suboptimal F1-score in the initial analysis. We applied a soft margin to the SVM to improve its classification robustness and generalization, but the accuracy and precision were still low in most surveys, increasing the chances of misclassifying individuals who tested for HIV. Logistic regression performed well in terms of accuracy = 72.75, precision = 73.64, and AUC = 81.41 in 2002, and the F1-score = 73.83 in 2017, but its performance was somewhat lower than that of the random forest. Decision trees demonstrated moderate accuracy (73.80% in 2002) but were prone to overfitting. The topmost consistent predictors of HIV testing are knowledge of HIV testing sites, being a female, being a younger adult, having high socioeconomic status, and being well-informed about HIV through digital platforms. Random forest's ability to analyze complex datasets makes it a valuable tool for informing data-driven policy initiatives, such as raising awareness, engaging the media, improving employment outcomes, enhancing accessibility, and targeting high-risk individuals. By addressing the identified gaps in the existing healthcare framework, South Africa can enhance the efficacy of HIV testing and progress towards achieving the UNAIDS 2030 goal of eradicating AIDS.","PeriodicalId":23330,"journal":{"name":"Tropical Medicine and Infectious Disease","volume":"10 6","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12197452/pdf/","citationCount":"0","resultStr":"{\"title\":\"The Application of Machine Learning Algorithms to Predict HIV Testing Using Evidence from the 2002-2017 South African Adult Population-Based Surveys: An HIV Testing Predictive Model.\",\"authors\":\"Musa Jaiteh, Edith Phalane, Yegnanew A Shiferaw, Haruna Jallow, Refilwe Nancy Phaswana-Mafuya\",\"doi\":\"10.3390/tropicalmed10060167\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There is a significant portion of the South African population with unknown HIV status, which slows down epidemic control despite the progress made in HIV testing. Machine learning (ML) has been effective in identifying individuals at higher risk of HIV infection, for whom testing is strongly recommended. However, there are insufficient predictive models to inform targeted HIV testing interventions in South Africa. By harnessing the power of supervised ML (SML) algorithms, this study aimed to identify the most consistent predictors of HIV testing in repeated adult population-based surveys in South Africa. The study employed four SML algorithms, namely, decision trees, random forest, support vector machines (SVM), and logistic regression, across the five cross-sectional cycles of the South African National HIV Prevalence, Incidence, and Behavior and Communication Survey (SABSSM) datasets. The Human Science Research Council (HSRC) conducted the SABSSM surveys and made the datasets available for this study. Each dataset was split into 80% training and 20% testing sets with a 5-fold cross-validation technique. The random forest outperformed the other models across all five datasets with the highest accuracy (80.98%), precision (81.51%), F1-score (80.30%), area under the curve (AUC) (88.31%), and cross-validation average (79.10%) in the 2002 data. Random forest achieved the highest classification performance across all the dates, especially in the 2017 survey. SVM had a high recall (89.12% in 2005, 86.28% in 2008) but lower precision, leading to a suboptimal F1-score in the initial analysis. We applied a soft margin to the SVM to improve its classification robustness and generalization, but the accuracy and precision were still low in most surveys, increasing the chances of misclassifying individuals who tested for HIV. Logistic regression performed well in terms of accuracy = 72.75, precision = 73.64, and AUC = 81.41 in 2002, and the F1-score = 73.83 in 2017, but its performance was somewhat lower than that of the random forest. Decision trees demonstrated moderate accuracy (73.80% in 2002) but were prone to overfitting. The topmost consistent predictors of HIV testing are knowledge of HIV testing sites, being a female, being a younger adult, having high socioeconomic status, and being well-informed about HIV through digital platforms. Random forest's ability to analyze complex datasets makes it a valuable tool for informing data-driven policy initiatives, such as raising awareness, engaging the media, improving employment outcomes, enhancing accessibility, and targeting high-risk individuals. By addressing the identified gaps in the existing healthcare framework, South Africa can enhance the efficacy of HIV testing and progress towards achieving the UNAIDS 2030 goal of eradicating AIDS.\",\"PeriodicalId\":23330,\"journal\":{\"name\":\"Tropical Medicine and Infectious Disease\",\"volume\":\"10 6\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-06-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12197452/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Tropical Medicine and Infectious Disease\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3390/tropicalmed10060167\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"INFECTIOUS DISEASES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tropical Medicine and Infectious Disease","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/tropicalmed10060167","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFECTIOUS DISEASES","Score":null,"Total":0}

引用次数: 0

摘要

南非人口中有很大一部分人的艾滋病毒状况不明，尽管在艾滋病毒检测方面取得了进展，但这放慢了流行病控制的速度。机器学习（ML）在识别艾滋病毒感染风险较高的个体方面很有效，强烈建议对这些个体进行检测。然而，在南非没有足够的预测模型为有针对性的艾滋病毒检测干预提供信息。通过利用监督式机器学习（SML）算法的力量，本研究旨在确定南非基于成人人群的重复调查中最一致的HIV检测预测因子。该研究采用了四种SML算法，即决策树、随机森林、支持向量机（SVM）和逻辑回归，跨越南非国家艾滋病毒流行、发病率、行为和交流调查（SABSSM）数据集的五个横截面周期。人类科学研究委员会（HSRC）进行了SABSSM调查，并为本研究提供了数据集。每个数据集被分成80%的训练集和20%的测试集，使用5倍交叉验证技术。随机森林模型在所有5个数据集上均优于其他模型，在2002年数据中，随机森林模型的准确率最高（80.98%），精密度最高（81.51%），f1得分最高（80.30%），曲线下面积（AUC）最高（88.31%），交叉验证平均值最高（79.10%）。随机森林在所有日期都取得了最高的分类性能，特别是在2017年的调查中。支持向量机具有较高的召回率（2005年为89.12%，2008年为86.28%），但精度较低，导致初始分析的f1得分不理想。我们对支持向量机应用软边际来提高其分类稳健性和泛化，但在大多数调查中，准确度和精度仍然很低，增加了对HIV检测个体进行错误分类的机会。Logistic回归在2002年的准确率为72.75，精度为73.64，AUC为81.41,2017年的f1得分为73.83，表现良好，但其性能略低于随机森林。决策树显示出中等准确度（2002年为73.80%），但容易过拟合。艾滋病毒检测最一致的预测因素是了解艾滋病毒检测地点，是女性，是年轻的成年人，具有较高的社会经济地位，并通过数字平台了解艾滋病毒。随机森林分析复杂数据集的能力使其成为为数据驱动的政策举措提供信息的宝贵工具，例如提高认识、吸引媒体、改善就业结果、增强可访问性和针对高风险人群。通过解决现有保健框架中已确定的差距，南非可以提高艾滋病毒检测的效力，并在实现联合国艾滋病规划署2030年根除艾滋病目标方面取得进展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Application of Machine Learning Algorithms to Predict HIV Testing Using Evidence from the 2002-2017 South African Adult Population-Based Surveys: An HIV Testing Predictive Model.

There is a significant portion of the South African population with unknown HIV status, which slows down epidemic control despite the progress made in HIV testing. Machine learning (ML) has been effective in identifying individuals at higher risk of HIV infection, for whom testing is strongly recommended. However, there are insufficient predictive models to inform targeted HIV testing interventions in South Africa. By harnessing the power of supervised ML (SML) algorithms, this study aimed to identify the most consistent predictors of HIV testing in repeated adult population-based surveys in South Africa. The study employed four SML algorithms, namely, decision trees, random forest, support vector machines (SVM), and logistic regression, across the five cross-sectional cycles of the South African National HIV Prevalence, Incidence, and Behavior and Communication Survey (SABSSM) datasets. The Human Science Research Council (HSRC) conducted the SABSSM surveys and made the datasets available for this study. Each dataset was split into 80% training and 20% testing sets with a 5-fold cross-validation technique. The random forest outperformed the other models across all five datasets with the highest accuracy (80.98%), precision (81.51%), F₁-score (80.30%), area under the curve (AUC) (88.31%), and cross-validation average (79.10%) in the 2002 data. Random forest achieved the highest classification performance across all the dates, especially in the 2017 survey. SVM had a high recall (89.12% in 2005, 86.28% in 2008) but lower precision, leading to a suboptimal F₁-score in the initial analysis. We applied a soft margin to the SVM to improve its classification robustness and generalization, but the accuracy and precision were still low in most surveys, increasing the chances of misclassifying individuals who tested for HIV. Logistic regression performed well in terms of accuracy = 72.75, precision = 73.64, and AUC = 81.41 in 2002, and the F₁-score = 73.83 in 2017, but its performance was somewhat lower than that of the random forest. Decision trees demonstrated moderate accuracy (73.80% in 2002) but were prone to overfitting. The topmost consistent predictors of HIV testing are knowledge of HIV testing sites, being a female, being a younger adult, having high socioeconomic status, and being well-informed about HIV through digital platforms. Random forest's ability to analyze complex datasets makes it a valuable tool for informing data-driven policy initiatives, such as raising awareness, engaging the media, improving employment outcomes, enhancing accessibility, and targeting high-risk individuals. By addressing the identified gaps in the existing healthcare framework, South Africa can enhance the efficacy of HIV testing and progress towards achieving the UNAIDS 2030 goal of eradicating AIDS.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊