Modeling the determinants of attrition in a two-stage epilepsy prevalence survey in Nairobi using machine learning

Daniel M. Mwanga , Isaac C. Kipchirchir , George O. Muhua , Charles R. Newton , Damazo T. Kadengye
{"title":"Modeling the determinants of attrition in a two-stage epilepsy prevalence survey in Nairobi using machine learning","authors":"Daniel M. Mwanga ,&nbsp;Isaac C. Kipchirchir ,&nbsp;George O. Muhua ,&nbsp;Charles R. Newton ,&nbsp;Damazo T. Kadengye","doi":"10.1016/j.gloepi.2025.100183","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Attrition is a challenge in parameter estimation in both longitudinal and multi-stage cross-sectional studies. Here, we examine utility of machine learning to predict attrition and identify associated factors in a two-stage population-based epilepsy prevalence study in Nairobi.</div></div><div><h3>Methods</h3><div>All individuals in the Nairobi Urban Health and Demographic Surveillance System (NUHDSS) (Korogocho and Viwandani) were screened for epilepsy in two stages. Attrition was defined as probable epilepsy cases identified at stage-I but who did not attend stage-II (neurologist assessment). Categorical variables were one-hot encoded, class imbalance was addressed using synthetic minority over-sampling technique (SMOTE) and numeric variables were scaled and centered. The dataset was split into training and testing sets (7:3 ratio), and seven machine learning models, including the ensemble Super Learner, were trained. Hyperparameters were tuned using 10-fold cross-validation, and model performance evaluated using metrics like Area under the curve (AUC), accuracy, Brier score and F1 score over 500 bootstrap samples of the test data.</div></div><div><h3>Results</h3><div>Random forest (AUC = 0.98, accuracy = 0.95, Brier score = 0.06, and F1 = 0.94), extreme gradient boost (XGB) (AUC = 0.96, accuracy = 0.91, Brier score = 0.08, F1 = 0.90) and support vector machine (SVM) (AUC = 0.93, accuracy = 0.93, Brier score = 0.07, F1 = 0.92) were the best performing models (base learners). Ensemble Super Learner had similarly high performance. Important predictors of attrition included proximity to industrial areas, male gender, employment, education, smaller households, and a history of complex partial seizures.</div></div><div><h3>Conclusion</h3><div>These findings can aid researchers plan targeted mobilization for scheduled clinical appointments to improve follow-up rates. These findings will inform development of a web-based algorithm to predict attrition risk and aid in targeted follow-up efforts in similar studies.</div></div>","PeriodicalId":36311,"journal":{"name":"Global Epidemiology","volume":"9 ","pages":"Article 100183"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Global Epidemiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S259011332500001X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background

Attrition is a challenge in parameter estimation in both longitudinal and multi-stage cross-sectional studies. Here, we examine utility of machine learning to predict attrition and identify associated factors in a two-stage population-based epilepsy prevalence study in Nairobi.

Methods

All individuals in the Nairobi Urban Health and Demographic Surveillance System (NUHDSS) (Korogocho and Viwandani) were screened for epilepsy in two stages. Attrition was defined as probable epilepsy cases identified at stage-I but who did not attend stage-II (neurologist assessment). Categorical variables were one-hot encoded, class imbalance was addressed using synthetic minority over-sampling technique (SMOTE) and numeric variables were scaled and centered. The dataset was split into training and testing sets (7:3 ratio), and seven machine learning models, including the ensemble Super Learner, were trained. Hyperparameters were tuned using 10-fold cross-validation, and model performance evaluated using metrics like Area under the curve (AUC), accuracy, Brier score and F1 score over 500 bootstrap samples of the test data.

Results

Random forest (AUC = 0.98, accuracy = 0.95, Brier score = 0.06, and F1 = 0.94), extreme gradient boost (XGB) (AUC = 0.96, accuracy = 0.91, Brier score = 0.08, F1 = 0.90) and support vector machine (SVM) (AUC = 0.93, accuracy = 0.93, Brier score = 0.07, F1 = 0.92) were the best performing models (base learners). Ensemble Super Learner had similarly high performance. Important predictors of attrition included proximity to industrial areas, male gender, employment, education, smaller households, and a history of complex partial seizures.

Conclusion

These findings can aid researchers plan targeted mobilization for scheduled clinical appointments to improve follow-up rates. These findings will inform development of a web-based algorithm to predict attrition risk and aid in targeted follow-up efforts in similar studies.
利用机器学习对内罗毕两阶段癫痫患病率调查中的损耗决定因素进行建模
在纵向和多阶段横断面研究中,磨损是参数估计的一个挑战。在这里,我们研究了机器学习在内罗毕两阶段基于人群的癫痫患病率研究中预测损耗和识别相关因素的效用。方法对内罗毕城市健康与人口监测系统(NUHDSS) (Korogocho和Viwandani)的所有人群进行癫痫筛查,并分两个阶段进行。损耗被定义为在第一阶段确定但未参加第二阶段(神经科医生评估)的可能癫痫病例。分类变量采用单热编码,类不平衡问题采用合成少数过采样技术(SMOTE)解决,数值变量进行缩放和居中处理。数据集被分成训练集和测试集(7:3的比例),并训练了包括集成超级学习者在内的7个机器学习模型。使用10倍交叉验证来调整超参数,并使用曲线下面积(AUC)、准确性、Brier分数和超过500个测试数据bootstrap样本的F1分数等指标来评估模型性能。ResultsRandom森林(AUC = 0.98,准确性 = 0.95,荆棘分数 = 0.06,0.94和F1 = ),极端的梯度提升(XGB) (AUC = 0.96、准确性 = 0.91,荆棘分数 = 0.08,F1 = 0.90)和支持向量机(SVM) (AUC = 0.93、准确性 = 0.93,荆棘分数 = 0.07,F1 = 0.92)是表现最好的模型(基础学习者)。Ensemble Super Learner也有同样高的表现。磨损的重要预测因素包括靠近工业区、男性、就业、教育程度、较小的家庭和复杂的部分癫痫史。结论这些发现有助于研究人员计划有针对性的临床预约动员,以提高随访率。这些发现将为基于网络的预测流失风险的算法的开发提供信息,并有助于在类似研究中进行有针对性的后续工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Global Epidemiology
Global Epidemiology Medicine-Infectious Diseases
CiteScore
5.00
自引率
0.00%
发文量
22
审稿时长
39 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信