Accounting for clustering for self-reported outcomes in the design and analysis of population-based surveys: A case study of estimation of prevalence of epilepsy in Nairobi, Kenya.

IF 1.6

Frontiers in research metrics and analytics Pub Date : 2025-09-01 eCollection Date: 2025-01-01 DOI:10.3389/frma.2025.1583476

Daniel M Mwanga, Isaac C Kipchirchir, George O Muhua, Charles R Newton, Damazo T Kadengye

{"title":"Accounting for clustering for self-reported outcomes in the design and analysis of population-based surveys: A case study of estimation of prevalence of epilepsy in Nairobi, Kenya.","authors":"Daniel M Mwanga, Isaac C Kipchirchir, George O Muhua, Charles R Newton, Damazo T Kadengye","doi":"10.3389/frma.2025.1583476","DOIUrl":null,"url":null,"abstract":"<p><p>Population-based surveys are common for estimation of important public health metrics such as prevalence. Often, survey data tend to have a hierarchical structure where households are clustered within villages or sites and interviewers are assigned specific locations to conduct the survey. Self-reported outcomes such as diagnosis of diseases like epilepsy present more complex structure, where interviewer or physician-related effects may bias the results. Standard estimation techniques that ignore clustering may lead to underestimated standard errors and overconfident inferences. In this paper, we examine these effects for estimation of prevalence of epilepsy in a two-stage population-based survey in Nairobi and we discuss how clustering can be taken into account in design and analysis of population-based prevalence studies. We used data from the Epilepsy Pathway Innovation in Africa project conducted in Nairobi and simulated attrition levels at 10% and 20% assuming missing at random (MAR) mechanism. Attrition was accounted for using sequential k-nearest neighbor method. We adjusted the expected prevalence based on clustering at multiple levels, such as site, interviewer and household using a random effects model. Intraclass correlation (ICC) > 0.1 indicated presence of substantial clustering. We report point estimates with 95% confidence interval (CI). Crude prevalence of epilepsy was 9.40 cases per 1,000 people (95% CI: 8.60-10.20). There was substantial clustering at household level (ICC = 0.397), interviewer level (ICC = 0.101) and site level (ICC = 0.070). Prevalence adjusted for clustering at household, interviewer and site was 9.15/1,000 (95% CI 7.11-11.20). Overall, not accounting for clustering was associated with underestimation of standard errors. Not accounting for attrition on the other hand led to underestimation of prevalence. Imputation of the missing data due to attrition mitigated the attrition bias under appropriate assumptions. Accounting for clustering, particularly household, interviewer and site levels, is critical for valid estimation of standard errors in population-based surveys. Rigorous training and pre-survey testing can minimize measurement error in self-reported outcomes. Attrition can lead to underestimation of prevalence if not properly addressed. Attrition bias can be minimized by conducting targeted mobilization of participants to improve response rates and using statistical methods such as multiple imputation or machine learning-based imputation methods to address it.</p>","PeriodicalId":73104,"journal":{"name":"Frontiers in research metrics and analytics","volume":"10 ","pages":"1583476"},"PeriodicalIF":1.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12433999/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in research metrics and analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frma.2025.1583476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Population-based surveys are common for estimation of important public health metrics such as prevalence. Often, survey data tend to have a hierarchical structure where households are clustered within villages or sites and interviewers are assigned specific locations to conduct the survey. Self-reported outcomes such as diagnosis of diseases like epilepsy present more complex structure, where interviewer or physician-related effects may bias the results. Standard estimation techniques that ignore clustering may lead to underestimated standard errors and overconfident inferences. In this paper, we examine these effects for estimation of prevalence of epilepsy in a two-stage population-based survey in Nairobi and we discuss how clustering can be taken into account in design and analysis of population-based prevalence studies. We used data from the Epilepsy Pathway Innovation in Africa project conducted in Nairobi and simulated attrition levels at 10% and 20% assuming missing at random (MAR) mechanism. Attrition was accounted for using sequential k-nearest neighbor method. We adjusted the expected prevalence based on clustering at multiple levels, such as site, interviewer and household using a random effects model. Intraclass correlation (ICC) > 0.1 indicated presence of substantial clustering. We report point estimates with 95% confidence interval (CI). Crude prevalence of epilepsy was 9.40 cases per 1,000 people (95% CI: 8.60-10.20). There was substantial clustering at household level (ICC = 0.397), interviewer level (ICC = 0.101) and site level (ICC = 0.070). Prevalence adjusted for clustering at household, interviewer and site was 9.15/1,000 (95% CI 7.11-11.20). Overall, not accounting for clustering was associated with underestimation of standard errors. Not accounting for attrition on the other hand led to underestimation of prevalence. Imputation of the missing data due to attrition mitigated the attrition bias under appropriate assumptions. Accounting for clustering, particularly household, interviewer and site levels, is critical for valid estimation of standard errors in population-based surveys. Rigorous training and pre-survey testing can minimize measurement error in self-reported outcomes. Attrition can lead to underestimation of prevalence if not properly addressed. Attrition bias can be minimized by conducting targeted mobilization of participants to improve response rates and using statistical methods such as multiple imputation or machine learning-based imputation methods to address it.

Abstract Image

查看原文本刊更多论文

在基于人群的调查设计和分析中，对自我报告结果的聚类进行核算：肯尼亚内罗毕癫痫患病率估计的案例研究。

基于人口的调查通常用于估计流行率等重要公共卫生指标。通常，调查数据往往具有等级结构，家庭集中在村庄或站点内，采访者被分配到特定地点进行调查。自我报告的结果，如癫痫等疾病的诊断，呈现出更复杂的结构，采访者或医生相关的影响可能会使结果产生偏差。忽略聚类的标准估计技术可能会导致标准误差被低估和推断过于自信。在本文中，我们在内罗毕进行了一项两阶段的基于人群的调查，研究了这些影响对癫痫患病率的估计，并讨论了如何在基于人群的患病率研究的设计和分析中考虑聚类。我们使用了在内罗毕开展的非洲癫痫途径创新项目的数据，并模拟了10%和20%的损耗水平，假设随机缺失（MAR）机制。使用顺序k近邻法计算磨损。我们使用随机效应模型，在多个层次（如地点、访谈者和家庭）的聚类基础上调整预期患病率。类内相关性(ICC) >.1表明存在大量聚类。我们以95%置信区间（CI）报告点估计值。癫痫的粗患病率为每1000人9.40例（95% CI: 8.60-10.20）。在家庭水平（ICC = 0.397）、访谈者水平（ICC = 0.101）和站点水平（ICC = 0.070）存在大量聚类。经家庭、访谈者和现场聚类调整后的患病率为9.15/ 1000 （95% CI 7.11-11.20）。总的来说，不考虑聚类与标准误差的低估有关。另一方面，不考虑损耗会导致对患病率的低估。在适当的假设下，由于损耗而缺失的数据的代入减轻了损耗偏差。考虑聚类，特别是家庭、采访者和站点水平，对于有效估计基于人口的调查中的标准误差至关重要。严格的训练和调查前测试可以最大限度地减少自我报告结果的测量误差。如果处理不当，损耗可能导致对患病率的低估。通过有针对性地动员参与者来提高响应率，并使用统计方法（如多重imputation或基于机器学习的imputation方法）来解决这一问题，可以最大限度地减少损耗偏见。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊