The Pairwise Gaussian Random Field for High-Dimensional Data Imputation

Zhuhua Cai, C. Jermaine, Zografoula Vagena, Dionysios Logothetis, L. Perez
{"title":"The Pairwise Gaussian Random Field for High-Dimensional Data Imputation","authors":"Zhuhua Cai, C. Jermaine, Zografoula Vagena, Dionysios Logothetis, L. Perez","doi":"10.1109/ICDM.2013.149","DOIUrl":null,"url":null,"abstract":"In this paper, we consider the problem of imputation (recovering missing values) in very high-dimensional data with an arbitrary covariance structure. The modern solution to this problem is the Gaussian Markov random field (GMRF). The problem with applying a GMRF to very high-dimensional data imputation is that while the GMRF model itself can be useful even for data having tens of thousands of dimensions, utilizing a GMRF requires access to a sparsified, inverse covariance matrix for the data. Computing this matrix using even state-of-the-art methods is very costly, as it typically requires first estimating the covariance matrix from the data (at a O(nm2) cost for m dimensions and n data points) and then performing a regularized inversion of the estimated covariance matrix, which is also very expensive. This is impractical for even moderately-sized, high-dimensional data sets. In this paper, we propose a very simple alternative to the GMRF called the pair wise Gaussian random field or PGRF for short. The PGRF is a graphical, factor-based model. Unlike traditional Gaussian or GMRF models, a PGRF does not require a covariance or correlation matrix as input. Instead, a PGRF takes as input a set of p (dimension, dimension) pairs for which the user suspects there might be a strong correlation or anti-correlation. This set of pairs defines the graphical structure of the model, with a simple Gaussian factor associated with each of the p (dimension, dimension) pairs. Using this structure, it is easy to perform simultaneous inference and imputation of the model. The key benefit of the approach is that the time required for the PGRF to perform inference is approximately linear with respect to p, where p will typically be much smaller than the number of entries in a m×m covariance or precision matrix.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 13th International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2013.149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

In this paper, we consider the problem of imputation (recovering missing values) in very high-dimensional data with an arbitrary covariance structure. The modern solution to this problem is the Gaussian Markov random field (GMRF). The problem with applying a GMRF to very high-dimensional data imputation is that while the GMRF model itself can be useful even for data having tens of thousands of dimensions, utilizing a GMRF requires access to a sparsified, inverse covariance matrix for the data. Computing this matrix using even state-of-the-art methods is very costly, as it typically requires first estimating the covariance matrix from the data (at a O(nm2) cost for m dimensions and n data points) and then performing a regularized inversion of the estimated covariance matrix, which is also very expensive. This is impractical for even moderately-sized, high-dimensional data sets. In this paper, we propose a very simple alternative to the GMRF called the pair wise Gaussian random field or PGRF for short. The PGRF is a graphical, factor-based model. Unlike traditional Gaussian or GMRF models, a PGRF does not require a covariance or correlation matrix as input. Instead, a PGRF takes as input a set of p (dimension, dimension) pairs for which the user suspects there might be a strong correlation or anti-correlation. This set of pairs defines the graphical structure of the model, with a simple Gaussian factor associated with each of the p (dimension, dimension) pairs. Using this structure, it is easy to perform simultaneous inference and imputation of the model. The key benefit of the approach is that the time required for the PGRF to perform inference is approximately linear with respect to p, where p will typically be much smaller than the number of entries in a m×m covariance or precision matrix.
高维数据输入的成对高斯随机场
本文研究了具有任意协方差结构的高维数据的插值问题(缺失值恢复问题)。这个问题的现代解决方案是高斯马尔可夫随机场(GMRF)。将GMRF应用于非常高维的数据输入的问题在于,尽管GMRF模型本身对于具有数万维的数据也很有用,但利用GMRF需要访问数据的稀疏化逆协方差矩阵。即使使用最先进的方法计算这个矩阵也是非常昂贵的,因为它通常需要首先从数据中估计协方差矩阵(对于m个维度和n个数据点,成本为O(nm2)),然后对估计的协方差矩阵执行正则化反演,这也是非常昂贵的。即使对于中等大小的高维数据集,这也是不切实际的。在本文中,我们提出了一个非常简单的替代GMRF,称为对高斯随机场或简称PGRF。PGRF是一个图形化的、基于因素的模型。与传统的高斯或GMRF模型不同,PGRF不需要协方差或相关矩阵作为输入。相反,PGRF将一组p(维度,维度)对作为输入,用户怀疑其中可能存在强相关性或反相关性。这组对定义了模型的图形结构,每个p (dimension, dimension)对都有一个简单的高斯因子。利用这种结构,可以很容易地同时进行模型的推理和imputation。该方法的主要优点是,PGRF执行推理所需的时间与p近似线性,其中p通常比m×m协方差或精度矩阵中的条目数小得多。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信