{"title":"缺失值混合数据高斯Copula因果结构的鲁棒估计","authors":"Ruifei Cui, P. Groot, T. Heskes","doi":"10.1109/ICDM.2017.101","DOIUrl":null,"url":null,"abstract":"We consider the problem of causal structure learning from data with missing values, assumed to be drawn from a Gaussian copula model. First, we extend the 'Rank PC' algorithm, designed for Gaussian copula models with purely continuous data (so-called nonparanormal models), to incomplete data by applying rank correlation to pairwise complete observations and replacing the sample size with an effective sample size in the conditional independence tests to account for the information loss from missing values. The resulting approach works when the data are missing completely at random (MCAR). Then, we propose a Gibbs sampling procedure to draw correlation matrix samples from mixed data under missingness at random (MAR). These samples are translated into an average correlation matrix, and an effective sample size, resulting in the 'Copula PC' algorithm for incomplete data. Simulation study shows that: 1) the usage of the effective sample size significantly improves the performance of 'Rank PC' and 'Copula PC'; 2) 'Copula PC' estimates a more accurate correlation matrix and causal structure than 'Rank PC' under MCAR and, even more so, under MAR. Also, we illustrate our methods on a real-world data set about gene expression.","PeriodicalId":254086,"journal":{"name":"2017 IEEE International Conference on Data Mining (ICDM)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Robust Estimation of Gaussian Copula Causal Structure from Mixed Data with Missing Values\",\"authors\":\"Ruifei Cui, P. Groot, T. Heskes\",\"doi\":\"10.1109/ICDM.2017.101\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of causal structure learning from data with missing values, assumed to be drawn from a Gaussian copula model. First, we extend the 'Rank PC' algorithm, designed for Gaussian copula models with purely continuous data (so-called nonparanormal models), to incomplete data by applying rank correlation to pairwise complete observations and replacing the sample size with an effective sample size in the conditional independence tests to account for the information loss from missing values. The resulting approach works when the data are missing completely at random (MCAR). Then, we propose a Gibbs sampling procedure to draw correlation matrix samples from mixed data under missingness at random (MAR). These samples are translated into an average correlation matrix, and an effective sample size, resulting in the 'Copula PC' algorithm for incomplete data. Simulation study shows that: 1) the usage of the effective sample size significantly improves the performance of 'Rank PC' and 'Copula PC'; 2) 'Copula PC' estimates a more accurate correlation matrix and causal structure than 'Rank PC' under MCAR and, even more so, under MAR. Also, we illustrate our methods on a real-world data set about gene expression.\",\"PeriodicalId\":254086,\"journal\":{\"name\":\"2017 IEEE International Conference on Data Mining (ICDM)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Data Mining (ICDM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM.2017.101\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2017.101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Robust Estimation of Gaussian Copula Causal Structure from Mixed Data with Missing Values
We consider the problem of causal structure learning from data with missing values, assumed to be drawn from a Gaussian copula model. First, we extend the 'Rank PC' algorithm, designed for Gaussian copula models with purely continuous data (so-called nonparanormal models), to incomplete data by applying rank correlation to pairwise complete observations and replacing the sample size with an effective sample size in the conditional independence tests to account for the information loss from missing values. The resulting approach works when the data are missing completely at random (MCAR). Then, we propose a Gibbs sampling procedure to draw correlation matrix samples from mixed data under missingness at random (MAR). These samples are translated into an average correlation matrix, and an effective sample size, resulting in the 'Copula PC' algorithm for incomplete data. Simulation study shows that: 1) the usage of the effective sample size significantly improves the performance of 'Rank PC' and 'Copula PC'; 2) 'Copula PC' estimates a more accurate correlation matrix and causal structure than 'Rank PC' under MCAR and, even more so, under MAR. Also, we illustrate our methods on a real-world data set about gene expression.