{"title":"影响缺失数据输入的四个因素","authors":"A. Hackl, Jürgen Zeindl, Lisa Ehrlinger","doi":"10.1145/3603719.3604285","DOIUrl":null,"url":null,"abstract":"Missing data is a common problem in datasets and impacts the reliability of data analysis. Numerous methods to impute (i.e., predict and replace) missing values have been proposed. The quality of these imputed values depends on factors like correlation, percentage of missingness, or the mechanism behind the missing value. Despite comparative studies on imputation methods, conditions for their effectiveness and safe application lack dedicated investigation. This research aims to systematically investigate the impact of four factors on imputation quality. We specifically investigate the extent to which (1) missing data mechanism, (2) variable distribution, (3) correlation, and (4) percentage of missingness affect the imputation quality of eight different machine-learning-based imputation methods. The evaluation will be done on both a synthetic dataset and a real-world dataset from voestalpine Stahl GmbH.","PeriodicalId":314512,"journal":{"name":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","volume":"109 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Four Factors Affecting Missing Data Imputation\",\"authors\":\"A. Hackl, Jürgen Zeindl, Lisa Ehrlinger\",\"doi\":\"10.1145/3603719.3604285\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Missing data is a common problem in datasets and impacts the reliability of data analysis. Numerous methods to impute (i.e., predict and replace) missing values have been proposed. The quality of these imputed values depends on factors like correlation, percentage of missingness, or the mechanism behind the missing value. Despite comparative studies on imputation methods, conditions for their effectiveness and safe application lack dedicated investigation. This research aims to systematically investigate the impact of four factors on imputation quality. We specifically investigate the extent to which (1) missing data mechanism, (2) variable distribution, (3) correlation, and (4) percentage of missingness affect the imputation quality of eight different machine-learning-based imputation methods. The evaluation will be done on both a synthetic dataset and a real-world dataset from voestalpine Stahl GmbH.\",\"PeriodicalId\":314512,\"journal\":{\"name\":\"Proceedings of the 35th International Conference on Scientific and Statistical Database Management\",\"volume\":\"109 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 35th International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3603719.3604285\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603719.3604285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Missing data is a common problem in datasets and impacts the reliability of data analysis. Numerous methods to impute (i.e., predict and replace) missing values have been proposed. The quality of these imputed values depends on factors like correlation, percentage of missingness, or the mechanism behind the missing value. Despite comparative studies on imputation methods, conditions for their effectiveness and safe application lack dedicated investigation. This research aims to systematically investigate the impact of four factors on imputation quality. We specifically investigate the extent to which (1) missing data mechanism, (2) variable distribution, (3) correlation, and (4) percentage of missingness affect the imputation quality of eight different machine-learning-based imputation methods. The evaluation will be done on both a synthetic dataset and a real-world dataset from voestalpine Stahl GmbH.