Jia Li, Mengdie Wang, M. Steinbach, Vipin Kumar, György J. Simon
{"title":"不要代入:处理电子病历数据分析中的信息缺失值","authors":"Jia Li, Mengdie Wang, M. Steinbach, Vipin Kumar, György J. Simon","doi":"10.1109/ICBK.2018.00062","DOIUrl":null,"url":null,"abstract":"Missing values pose a significant challenge in data analytic, especially in clinical studies, data is typically missing-not-at-random (MNAR). Applying techniques (e.g. imputations) that were designed for missing-at-random (MAR) to MNAR data, can lead to biases. In this work, we propose pattern-wise analysis, a collection of methods for building predictive models in the presence of MNAR missing values. On a per-pattern basis, this methodology constructs an individual model for each missingness pattern. We show that even the simplest pattern-wise method, Per-Pattern Modeling (PPM) outperforms models built on data sets completed by the most popular imputation methods. PPM faces difficulty when the number of missingness patterns is too high or when the missingness patterns have too few observations. We developed variants of PPM to overcome these challenges from three complementary perspectives: (i) from a model selection perspective, where PPM can select patterns to build models; (ii) a distributional perspective, where the training data set is expanded in a distribution-preserving fashion; and (iii) from a causal perspective, where a causal structure for the MNAR mechanism is assumed and exploited to convert the problem from MNAR to MAR. Evaluation of the proposed methods on both synthetic MNAR data and a real-world clinical data set of sepsis patients shows notable improvement over traditional approaches.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Don't Do Imputation: Dealing with Informative Missing Values in EHR Data Analysis\",\"authors\":\"Jia Li, Mengdie Wang, M. Steinbach, Vipin Kumar, György J. Simon\",\"doi\":\"10.1109/ICBK.2018.00062\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Missing values pose a significant challenge in data analytic, especially in clinical studies, data is typically missing-not-at-random (MNAR). Applying techniques (e.g. imputations) that were designed for missing-at-random (MAR) to MNAR data, can lead to biases. In this work, we propose pattern-wise analysis, a collection of methods for building predictive models in the presence of MNAR missing values. On a per-pattern basis, this methodology constructs an individual model for each missingness pattern. We show that even the simplest pattern-wise method, Per-Pattern Modeling (PPM) outperforms models built on data sets completed by the most popular imputation methods. PPM faces difficulty when the number of missingness patterns is too high or when the missingness patterns have too few observations. We developed variants of PPM to overcome these challenges from three complementary perspectives: (i) from a model selection perspective, where PPM can select patterns to build models; (ii) a distributional perspective, where the training data set is expanded in a distribution-preserving fashion; and (iii) from a causal perspective, where a causal structure for the MNAR mechanism is assumed and exploited to convert the problem from MNAR to MAR. Evaluation of the proposed methods on both synthetic MNAR data and a real-world clinical data set of sepsis patients shows notable improvement over traditional approaches.\",\"PeriodicalId\":144958,\"journal\":{\"name\":\"2018 IEEE International Conference on Big Knowledge (ICBK)\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Big Knowledge (ICBK)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICBK.2018.00062\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Big Knowledge (ICBK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBK.2018.00062","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Don't Do Imputation: Dealing with Informative Missing Values in EHR Data Analysis
Missing values pose a significant challenge in data analytic, especially in clinical studies, data is typically missing-not-at-random (MNAR). Applying techniques (e.g. imputations) that were designed for missing-at-random (MAR) to MNAR data, can lead to biases. In this work, we propose pattern-wise analysis, a collection of methods for building predictive models in the presence of MNAR missing values. On a per-pattern basis, this methodology constructs an individual model for each missingness pattern. We show that even the simplest pattern-wise method, Per-Pattern Modeling (PPM) outperforms models built on data sets completed by the most popular imputation methods. PPM faces difficulty when the number of missingness patterns is too high or when the missingness patterns have too few observations. We developed variants of PPM to overcome these challenges from three complementary perspectives: (i) from a model selection perspective, where PPM can select patterns to build models; (ii) a distributional perspective, where the training data set is expanded in a distribution-preserving fashion; and (iii) from a causal perspective, where a causal structure for the MNAR mechanism is assumed and exploited to convert the problem from MNAR to MAR. Evaluation of the proposed methods on both synthetic MNAR data and a real-world clinical data set of sepsis patients shows notable improvement over traditional approaches.