在协变量数据缺失的情况下使用分类和回归树进行倾向评分估计

Q3 Mathematics

Epidemiologic Methods Pub Date : 2018-07-25 DOI:10.1515/em-2017-0020

Bas B L Penning de Vries, M. van Smeden, R. Groenwold

{"title":"在协变量数据缺失的情况下使用分类和回归树进行倾向评分估计","authors":"Bas B L Penning de Vries, M. van Smeden, R. Groenwold","doi":"10.1515/em-2017-0020","DOIUrl":null,"url":null,"abstract":"Abstract Data mining and machine learning techniques such as classification and regression trees (CART) represent a promising alternative to conventional logistic regression for propensity score estimation. Whereas incomplete data preclude the fitting of a logistic regression on all subjects, CART is appealing in part because some implementations allow for incomplete records to be incorporated in the tree fitting and provide propensity score estimates for all subjects. Based on theoretical considerations, we argue that the automatic handling of missing data by CART may however not be appropriate. Using a series of simulation experiments, we examined the performance of different approaches to handling missing covariate data; (i) applying the CART algorithm directly to the (partially) incomplete data, (ii) complete case analysis, and (iii) multiple imputation. Performance was assessed in terms of bias in estimating exposure-outcome effects among the exposed, standard error, mean squared error and coverage. Applying the CART algorithm directly to incomplete data resulted in bias, even in scenarios where data were missing completely at random. Overall, multiple imputation followed by CART resulted in the best performance. Our study showed that automatic handling of missing data in CART can cause serious bias and does not outperform multiple imputation as a means to account for missing data.","PeriodicalId":37999,"journal":{"name":"Epidemiologic Methods","volume":"49 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Propensity Score Estimation Using Classification and Regression Trees in the Presence of Missing Covariate Data\",\"authors\":\"Bas B L Penning de Vries, M. van Smeden, R. Groenwold\",\"doi\":\"10.1515/em-2017-0020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Data mining and machine learning techniques such as classification and regression trees (CART) represent a promising alternative to conventional logistic regression for propensity score estimation. Whereas incomplete data preclude the fitting of a logistic regression on all subjects, CART is appealing in part because some implementations allow for incomplete records to be incorporated in the tree fitting and provide propensity score estimates for all subjects. Based on theoretical considerations, we argue that the automatic handling of missing data by CART may however not be appropriate. Using a series of simulation experiments, we examined the performance of different approaches to handling missing covariate data; (i) applying the CART algorithm directly to the (partially) incomplete data, (ii) complete case analysis, and (iii) multiple imputation. Performance was assessed in terms of bias in estimating exposure-outcome effects among the exposed, standard error, mean squared error and coverage. Applying the CART algorithm directly to incomplete data resulted in bias, even in scenarios where data were missing completely at random. Overall, multiple imputation followed by CART resulted in the best performance. Our study showed that automatic handling of missing data in CART can cause serious bias and does not outperform multiple imputation as a means to account for missing data.\",\"PeriodicalId\":37999,\"journal\":{\"name\":\"Epidemiologic Methods\",\"volume\":\"49 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Epidemiologic Methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1515/em-2017-0020\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Epidemiologic Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/em-2017-0020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 8

摘要

数据挖掘和机器学习技术，如分类和回归树(CART)代表了传统逻辑回归对倾向评分估计的一个有希望的替代方案。虽然不完整的数据排除了对所有受试者进行逻辑回归的拟合，但CART之所以吸引人，部分原因是一些实现允许将不完整的记录纳入树拟合中，并为所有受试者提供倾向得分估计。基于理论上的考虑，我们认为CART对丢失数据的自动处理可能并不合适。通过一系列模拟实验，我们检验了处理缺失协变量数据的不同方法的性能;(i)将CART算法直接应用于(部分)不完整的数据，(ii)完整的案例分析，以及(iii)多次插值。评估的标准是评估暴露者的暴露-结果效应偏差、标准误差、均方误差和覆盖率。将CART算法直接应用于不完整的数据会导致偏差，即使在数据完全随机丢失的情况下也是如此。总体而言，多次插补后进行CART的效果最好。我们的研究表明，自动处理CART中缺失的数据可能会导致严重的偏差，并且作为一种解释缺失数据的手段，多重输入的效果并不好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Propensity Score Estimation Using Classification and Regression Trees in the Presence of Missing Covariate Data

Abstract Data mining and machine learning techniques such as classification and regression trees (CART) represent a promising alternative to conventional logistic regression for propensity score estimation. Whereas incomplete data preclude the fitting of a logistic regression on all subjects, CART is appealing in part because some implementations allow for incomplete records to be incorporated in the tree fitting and provide propensity score estimates for all subjects. Based on theoretical considerations, we argue that the automatic handling of missing data by CART may however not be appropriate. Using a series of simulation experiments, we examined the performance of different approaches to handling missing covariate data; (i) applying the CART algorithm directly to the (partially) incomplete data, (ii) complete case analysis, and (iii) multiple imputation. Performance was assessed in terms of bias in estimating exposure-outcome effects among the exposed, standard error, mean squared error and coverage. Applying the CART algorithm directly to incomplete data resulted in bias, even in scenarios where data were missing completely at random. Overall, multiple imputation followed by CART resulted in the best performance. Our study showed that automatic handling of missing data in CART can cause serious bias and does not outperform multiple imputation as a means to account for missing data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Epidemiologic Methods Mathematics-Applied Mathematics

CiteScore

2.10

自引率

0.00%

发文量

期刊介绍： Epidemiologic Methods (EM) seeks contributions comparable to those of the leading epidemiologic journals, but also invites papers that may be more technical or of greater length than what has traditionally been allowed by journals in epidemiology. Applications and examples with real data to illustrate methodology are strongly encouraged but not required. Topics. genetic epidemiology, infectious disease, pharmaco-epidemiology, ecologic studies, environmental exposures, screening, surveillance, social networks, comparative effectiveness, statistical modeling, causal inference, measurement error, study design, meta-analysis