Modeling the determinants of attrition in a two-stage epilepsy prevalence survey in Nairobi using machine learning

Global Epidemiology Pub Date : 2025-01-06 DOI:10.1016/j.gloepi.2025.100183

Daniel M. Mwanga , Isaac C. Kipchirchir , George O. Muhua , Charles R. Newton , Damazo T. Kadengye

{"title":"Modeling the determinants of attrition in a two-stage epilepsy prevalence survey in Nairobi using machine learning","authors":"Daniel M. Mwanga , Isaac C. Kipchirchir , George O. Muhua , Charles R. Newton , Damazo T. Kadengye","doi":"10.1016/j.gloepi.2025.100183","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Attrition is a challenge in parameter estimation in both longitudinal and multi-stage cross-sectional studies. Here, we examine utility of machine learning to predict attrition and identify associated factors in a two-stage population-based epilepsy prevalence study in Nairobi.</div></div><div><h3>Methods</h3><div>All individuals in the Nairobi Urban Health and Demographic Surveillance System (NUHDSS) (Korogocho and Viwandani) were screened for epilepsy in two stages. Attrition was defined as probable epilepsy cases identified at stage-I but who did not attend stage-II (neurologist assessment). Categorical variables were one-hot encoded, class imbalance was addressed using synthetic minority over-sampling technique (SMOTE) and numeric variables were scaled and centered. The dataset was split into training and testing sets (7:3 ratio), and seven machine learning models, including the ensemble Super Learner, were trained. Hyperparameters were tuned using 10-fold cross-validation, and model performance evaluated using metrics like Area under the curve (AUC), accuracy, Brier score and F1 score over 500 bootstrap samples of the test data.</div></div><div><h3>Results</h3><div>Random forest (AUC = 0.98, accuracy = 0.95, Brier score = 0.06, and F1 = 0.94), extreme gradient boost (XGB) (AUC = 0.96, accuracy = 0.91, Brier score = 0.08, F1 = 0.90) and support vector machine (SVM) (AUC = 0.93, accuracy = 0.93, Brier score = 0.07, F1 = 0.92) were the best performing models (base learners). Ensemble Super Learner had similarly high performance. Important predictors of attrition included proximity to industrial areas, male gender, employment, education, smaller households, and a history of complex partial seizures.</div></div><div><h3>Conclusion</h3><div>These findings can aid researchers plan targeted mobilization for scheduled clinical appointments to improve follow-up rates. These findings will inform development of a web-based algorithm to predict attrition risk and aid in targeted follow-up efforts in similar studies.</div></div>","PeriodicalId":36311,"journal":{"name":"Global Epidemiology","volume":"9 ","pages":"Article 100183"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Global Epidemiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S259011332500001X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Attrition is a challenge in parameter estimation in both longitudinal and multi-stage cross-sectional studies. Here, we examine utility of machine learning to predict attrition and identify associated factors in a two-stage population-based epilepsy prevalence study in Nairobi.

Methods

All individuals in the Nairobi Urban Health and Demographic Surveillance System (NUHDSS) (Korogocho and Viwandani) were screened for epilepsy in two stages. Attrition was defined as probable epilepsy cases identified at stage-I but who did not attend stage-II (neurologist assessment). Categorical variables were one-hot encoded, class imbalance was addressed using synthetic minority over-sampling technique (SMOTE) and numeric variables were scaled and centered. The dataset was split into training and testing sets (7:3 ratio), and seven machine learning models, including the ensemble Super Learner, were trained. Hyperparameters were tuned using 10-fold cross-validation, and model performance evaluated using metrics like Area under the curve (AUC), accuracy, Brier score and F1 score over 500 bootstrap samples of the test data.

Results

Random forest (AUC = 0.98, accuracy = 0.95, Brier score = 0.06, and F1 = 0.94), extreme gradient boost (XGB) (AUC = 0.96, accuracy = 0.91, Brier score = 0.08, F1 = 0.90) and support vector machine (SVM) (AUC = 0.93, accuracy = 0.93, Brier score = 0.07, F1 = 0.92) were the best performing models (base learners). Ensemble Super Learner had similarly high performance. Important predictors of attrition included proximity to industrial areas, male gender, employment, education, smaller households, and a history of complex partial seizures.

Conclusion

These findings can aid researchers plan targeted mobilization for scheduled clinical appointments to improve follow-up rates. These findings will inform development of a web-based algorithm to predict attrition risk and aid in targeted follow-up efforts in similar studies.

查看原文本刊更多论文

利用机器学习对内罗毕两阶段癫痫患病率调查中的损耗决定因素进行建模

在纵向和多阶段横断面研究中，磨损是参数估计的一个挑战。在这里，我们研究了机器学习在内罗毕两阶段基于人群的癫痫患病率研究中预测损耗和识别相关因素的效用。方法对内罗毕城市健康与人口监测系统（NUHDSS）（Korogocho和Viwandani）的所有人群进行癫痫筛查，并分两个阶段进行。损耗被定义为在第一阶段确定但未参加第二阶段（神经科医生评估）的可能癫痫病例。分类变量采用单热编码，类不平衡问题采用合成少数过采样技术（SMOTE）解决，数值变量进行缩放和居中处理。数据集被分成训练集和测试集（7:3的比例），并训练了包括集成超级学习者在内的7个机器学习模型。使用10倍交叉验证来调整超参数，并使用曲线下面积（AUC）、准确性、Brier分数和超过500个测试数据bootstrap样本的F1分数等指标来评估模型性能。ResultsRandom森林(AUC = 0.98,准确性 = 0.95,荆棘分数 = 0.06,0.94和F1 = ),极端的梯度提升(XGB) (AUC = 0.96、准确性 = 0.91,荆棘分数 = 0.08,F1 = 0.90)和支持向量机(SVM) (AUC = 0.93、准确性 = 0.93,荆棘分数 = 0.07,F1 = 0.92)是表现最好的模型(基础学习者)。Ensemble Super Learner也有同样高的表现。磨损的重要预测因素包括靠近工业区、男性、就业、教育程度、较小的家庭和复杂的部分癫痫史。结论这些发现有助于研究人员计划有针对性的临床预约动员，以提高随访率。这些发现将为基于网络的预测流失风险的算法的开发提供信息，并有助于在类似研究中进行有针对性的后续工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊