PARTITIONING AROUND MEDOIDS CLUSTERING AND RANDOM FOREST CLASSIFICATION FOR GIS-INFORMED IMPUTATION OF FLUORIDE CONCENTRATION DATA.

IF 1.4 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics Pub Date : 2022-03-01 DOI:10.1214/21-aoas1516

Yu Gu, John S Preisser, Donglin Zeng, Poojan Shrestha, Molina Shah, Miguel A Simancas-Pallares, Jeannie Ginnis, Kimon Divaris

{"title":"PARTITIONING AROUND MEDOIDS CLUSTERING AND RANDOM FOREST CLASSIFICATION FOR GIS-INFORMED IMPUTATION OF FLUORIDE CONCENTRATION DATA.","authors":"Yu Gu, John S Preisser, Donglin Zeng, Poojan Shrestha, Molina Shah, Miguel A Simancas-Pallares, Jeannie Ginnis, Kimon Divaris","doi":"10.1214/21-aoas1516","DOIUrl":null,"url":null,"abstract":"Community water fluoridation is an important component of oral health promotion, as fluoride exposure is a well-documented dental caries-preventive agent. Direct measurements of domestic water fluoride content provide valuable information regarding individuals' fluoride exposure and thus caries risk; however, they are logistically challenging to carry out at a large scale in oral health research. This article describes the development and evaluation of a novel method for the imputation of missing domestic water fluoride concentration data informed by spatial autocorrelation. The context is a state-wide epidemiologic study of pediatric oral health in North Carolina, where domestic water fluoride concentration information was missing for approximately 75% of study participants with clinical data on dental caries. A new machine-learning-based imputation method that combines partitioning around medoids clustering and random forest classification (PAMRF) is developed and implemented. Imputed values are filtered according to allowable error rates or target sample size, depending on the requirements of each application. In leave-one-out cross-validation and simulation studies, PAMRF outperforms four existing imputation approaches-two conventional spatial interpolation methods (i.e., inverse-distance weighting, IDW and universal kriging, UK) and two supervised learning methods (k-nearest neighbors, KNN and classification and regression trees, CART). The inclusion of multiply imputed values in the estimation of the association between fluoride concentration and dental caries prevalence resulted in essentially no change in PAMRF estimates but substantial gains in precision due to larger effective sample size. PAMRF is a powerful new method for the imputation of missing fluoride values where geographical information exists.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"551-572"},"PeriodicalIF":1.4000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8963777/pdf/nihms-1731052.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Applied Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/21-aoas1516","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 2

Abstract

Community water fluoridation is an important component of oral health promotion, as fluoride exposure is a well-documented dental caries-preventive agent. Direct measurements of domestic water fluoride content provide valuable information regarding individuals' fluoride exposure and thus caries risk; however, they are logistically challenging to carry out at a large scale in oral health research. This article describes the development and evaluation of a novel method for the imputation of missing domestic water fluoride concentration data informed by spatial autocorrelation. The context is a state-wide epidemiologic study of pediatric oral health in North Carolina, where domestic water fluoride concentration information was missing for approximately 75% of study participants with clinical data on dental caries. A new machine-learning-based imputation method that combines partitioning around medoids clustering and random forest classification (PAMRF) is developed and implemented. Imputed values are filtered according to allowable error rates or target sample size, depending on the requirements of each application. In leave-one-out cross-validation and simulation studies, PAMRF outperforms four existing imputation approaches-two conventional spatial interpolation methods (i.e., inverse-distance weighting, IDW and universal kriging, UK) and two supervised learning methods (k-nearest neighbors, KNN and classification and regression trees, CART). The inclusion of multiply imputed values in the estimation of the association between fluoride concentration and dental caries prevalence resulted in essentially no change in PAMRF estimates but substantial gains in precision due to larger effective sample size. PAMRF is a powerful new method for the imputation of missing fluoride values where geographical information exists.

Abstract Image

查看原文本刊更多论文

基于地理信息系统的氟化物浓度数据的聚类和随机森林分类划分。

社区饮水加氟是促进口腔健康的一个重要组成部分，因为氟化物暴露是一种有充分证据的龋齿预防剂。对生活用水氟化物含量的直接测量提供了有关个人接触氟化物的宝贵信息，从而提供了龋齿风险;然而，在口腔健康研究中进行大规模的后勤挑战。本文介绍了一种基于空间自相关信息的生活用水氟化物浓度缺失数据补全新方法的开发与评价。本研究的背景是北卡罗来纳州一项全州范围的儿童口腔健康流行病学研究，其中约75%的研究参与者缺少有关龋齿临床数据的家庭用水氟化物浓度信息。提出并实现了一种基于机器学习的围绕介质聚类和随机森林分类相结合的插值方法。根据每个应用程序的要求，根据允许错误率或目标样本量对输入值进行过滤。在留一交叉验证和仿真研究中，PAMRF优于四种现有的插值方法，即两种传统的空间插值方法(即逆距离加权，IDW和通用克里格，UK)和两种监督学习方法(k-近邻，KNN和分类与回归树，CART)。在估计氟化物浓度与龋齿患病率之间的关系时纳入多个估算值，导致PAMRF估计值基本上没有变化，但由于有效样本量的增加，精度大大提高。PAMRF是一种强大的新方法，用于在存在地理信息的情况下计算缺失的氟化物值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Applied Statistics 社会科学-统计学与概率论

CiteScore

3.10

自引率

5.60%

发文量

131

审稿时长

6-12 weeks

期刊介绍： Statistical research spans an enormous range from direct subject-matter collaborations to pure mathematical theory. The Annals of Applied Statistics, the newest journal from the IMS, is aimed at papers in the applied half of this range. Published quarterly in both print and electronic form, our goal is to provide a timely and unified forum for all areas of applied statistics.