{"title":"ELMV","authors":"L. J. Liu, Hongwei Zhang, Jianzhong Di, Jin Chen","doi":"10.1145/3388440.3412431","DOIUrl":null,"url":null,"abstract":"Many real-world Electronic Health Record (EHR) data contain a large proportion of missing values. Leaving a substantial portion of missing information unaddressed usually causes significant bias, leading to invalid conclusions to be drawn. On the other hand, training a machine learning model with a much smaller nearly-complete subset can drastically impact the reliability and accuracy of model inference. Data imputation algorithms that attempt to replace missing data with meaningful values, inevitably increase the variability of effect estimates with increased missingness, making it unreliable for hypothesis validation. We propose a novel Ensemble-Learning for Missing Value (ELMV) framework, an effective approach to construct multiple subsets with much lower missing rates of the original EHR data as well as to mobilize dedicated support data for ensemble learning, for the purpose of reducing the bias caused by substantial missing values. ELMV has been evaluated on real-world healthcare data for critical feature identification and simulation data with different missing rates for outcome prediction. In both experiments, ELMV outperforms conventional missing value imputation methods and traditional ensemble learning models. The source code of ELMV is available at https://github.com/lucasliu0928/ELMV.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"141 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3412431","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Many real-world Electronic Health Record (EHR) data contain a large proportion of missing values. Leaving a substantial portion of missing information unaddressed usually causes significant bias, leading to invalid conclusions to be drawn. On the other hand, training a machine learning model with a much smaller nearly-complete subset can drastically impact the reliability and accuracy of model inference. Data imputation algorithms that attempt to replace missing data with meaningful values, inevitably increase the variability of effect estimates with increased missingness, making it unreliable for hypothesis validation. We propose a novel Ensemble-Learning for Missing Value (ELMV) framework, an effective approach to construct multiple subsets with much lower missing rates of the original EHR data as well as to mobilize dedicated support data for ensemble learning, for the purpose of reducing the bias caused by substantial missing values. ELMV has been evaluated on real-world healthcare data for critical feature identification and simulation data with different missing rates for outcome prediction. In both experiments, ELMV outperforms conventional missing value imputation methods and traditional ensemble learning models. The source code of ELMV is available at https://github.com/lucasliu0928/ELMV.