Optimal use of HIV testing resources accelerates progress towards ending HIV as a global threat. In Kenya, current testing practices yield a 2.8% positivity rate for new diagnoses reported through the national HIV electronic medical record (EMR) system. Increasingly, researchers have explored the potential for machine learning to improve the identification of people with undiagnosed HIV for referral for HIV testing. However, few studies have used routinely collected programme data as the basis for implementing a real-time clinical decision support system to improve HIV screening. In this study, we applied machine learning to routine programme data from Kenya's EMR to predict the probability that an individual seeking care is undiagnosed HIV positive and should be prioritized for testing.
We combined de-identified individual-level EMR data from 167,509 individuals without a previous HIV diagnosis who were tested between June and November 2022. We included demographics, clinical histories and HIV-relevant behavioural practices with open-source data that describes population-level behavioural practices as other variables in the model. We used multiple imputations to address high rates of missing data, selecting the optimal technique based on out-of-sample error. We generated a stratified 60-20-20 train-validate-test split to assess model generalizability. We trained four machine learning algorithms including logistic regression, Random Forest, AdaBoost and XGBoost. Models were evaluated using Area Under the Precision-Recall Curve (AUCPR), a metric that is well-suited to cases of class imbalance such as this, in which there are far more negative test results than positive.
All model types demonstrated predictive performance on the test set with AUCPR that exceeded the current positivity rate. XGBoost generated the greatest AUCPR, 10.5 times greater than the rate of positive test results.
Our study demonstrated that machine learning applied to routine HIV testing data may be used as a clinical decision support tool to refer persons for HIV testing. The resulting model could be integrated in the screening form of an EMR and used as a real-time decision support tool to inform testing decisions. Although issues of data quality and missing data remained, these challenges could be addressed using sound data preparation techniques.