The Pairwise Gaussian Random Field for High-Dimensional Data Imputation

2013 IEEE 13th International Conference on Data Mining Pub Date : 2013-12-01 DOI:10.1109/ICDM.2013.149

Zhuhua Cai, C. Jermaine, Zografoula Vagena, Dionysios Logothetis, L. Perez

{"title":"The Pairwise Gaussian Random Field for High-Dimensional Data Imputation","authors":"Zhuhua Cai, C. Jermaine, Zografoula Vagena, Dionysios Logothetis, L. Perez","doi":"10.1109/ICDM.2013.149","DOIUrl":null,"url":null,"abstract":"In this paper, we consider the problem of imputation (recovering missing values) in very high-dimensional data with an arbitrary covariance structure. The modern solution to this problem is the Gaussian Markov random field (GMRF). The problem with applying a GMRF to very high-dimensional data imputation is that while the GMRF model itself can be useful even for data having tens of thousands of dimensions, utilizing a GMRF requires access to a sparsified, inverse covariance matrix for the data. Computing this matrix using even state-of-the-art methods is very costly, as it typically requires first estimating the covariance matrix from the data (at a O(nm2) cost for m dimensions and n data points) and then performing a regularized inversion of the estimated covariance matrix, which is also very expensive. This is impractical for even moderately-sized, high-dimensional data sets. In this paper, we propose a very simple alternative to the GMRF called the pair wise Gaussian random field or PGRF for short. The PGRF is a graphical, factor-based model. Unlike traditional Gaussian or GMRF models, a PGRF does not require a covariance or correlation matrix as input. Instead, a PGRF takes as input a set of p (dimension, dimension) pairs for which the user suspects there might be a strong correlation or anti-correlation. This set of pairs defines the graphical structure of the model, with a simple Gaussian factor associated with each of the p (dimension, dimension) pairs. Using this structure, it is easy to perform simultaneous inference and imputation of the model. The key benefit of the approach is that the time required for the PGRF to perform inference is approximately linear with respect to p, where p will typically be much smaller than the number of entries in a m×m covariance or precision matrix.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 13th International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2013.149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

In this paper, we consider the problem of imputation (recovering missing values) in very high-dimensional data with an arbitrary covariance structure. The modern solution to this problem is the Gaussian Markov random field (GMRF). The problem with applying a GMRF to very high-dimensional data imputation is that while the GMRF model itself can be useful even for data having tens of thousands of dimensions, utilizing a GMRF requires access to a sparsified, inverse covariance matrix for the data. Computing this matrix using even state-of-the-art methods is very costly, as it typically requires first estimating the covariance matrix from the data (at a O(nm2) cost for m dimensions and n data points) and then performing a regularized inversion of the estimated covariance matrix, which is also very expensive. This is impractical for even moderately-sized, high-dimensional data sets. In this paper, we propose a very simple alternative to the GMRF called the pair wise Gaussian random field or PGRF for short. The PGRF is a graphical, factor-based model. Unlike traditional Gaussian or GMRF models, a PGRF does not require a covariance or correlation matrix as input. Instead, a PGRF takes as input a set of p (dimension, dimension) pairs for which the user suspects there might be a strong correlation or anti-correlation. This set of pairs defines the graphical structure of the model, with a simple Gaussian factor associated with each of the p (dimension, dimension) pairs. Using this structure, it is easy to perform simultaneous inference and imputation of the model. The key benefit of the approach is that the time required for the PGRF to perform inference is approximately linear with respect to p, where p will typically be much smaller than the number of entries in a m×m covariance or precision matrix.

查看原文本刊更多论文

高维数据输入的成对高斯随机场

本文研究了具有任意协方差结构的高维数据的插值问题(缺失值恢复问题)。这个问题的现代解决方案是高斯马尔可夫随机场(GMRF)。将GMRF应用于非常高维的数据输入的问题在于，尽管GMRF模型本身对于具有数万维的数据也很有用，但利用GMRF需要访问数据的稀疏化逆协方差矩阵。即使使用最先进的方法计算这个矩阵也是非常昂贵的，因为它通常需要首先从数据中估计协方差矩阵(对于m个维度和n个数据点，成本为O(nm2))，然后对估计的协方差矩阵执行正则化反演，这也是非常昂贵的。即使对于中等大小的高维数据集，这也是不切实际的。在本文中，我们提出了一个非常简单的替代GMRF，称为对高斯随机场或简称PGRF。PGRF是一个图形化的、基于因素的模型。与传统的高斯或GMRF模型不同，PGRF不需要协方差或相关矩阵作为输入。相反，PGRF将一组p(维度，维度)对作为输入，用户怀疑其中可能存在强相关性或反相关性。这组对定义了模型的图形结构，每个p (dimension, dimension)对都有一个简单的高斯因子。利用这种结构，可以很容易地同时进行模型的推理和imputation。该方法的主要优点是，PGRF执行推理所需的时间与p近似线性，其中p通常比m×m协方差或精度矩阵中的条目数小得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE 13th International Conference on Data Mining

自引率

0.00%

发文量