缺少高维数据的数据输入

The American Statistician Pub Date : 2023-10-02 DOI:10.1080/00031305.2023.2259962

Alberto Brini, Edwin R. van den Heuvel

{"title":"缺少高维数据的数据输入","authors":"Alberto Brini, Edwin R. van den Heuvel","doi":"10.1080/00031305.2023.2259962","DOIUrl":null,"url":null,"abstract":"AbstractImputation of missing data in high-dimensional datasets with more variables P than samples N, P≫N, is hampered by the data dimensionality. For multivariate imputation, the covariance matrix is ill conditioned and cannot be properly estimated. For fully conditional imputation, the regression models for imputation cannot include all the variables. Thus, the high dimension requires special imputation approaches. In this paper, we provide an overview and realistic comparisons of imputation approaches for high-dimensional data when applied to a linear mixed modelling (LMM) framework. We examine approaches from three different classes using simulation studies: multiple imputation with penalized regression, multiple imputation with recursive partitioning and predictive mean matching and multiple imputation with Principal Component Analysis (PCA). We illustrate the methods on a real case study where a multivariate outcome, i.e., an extracted set of correlated biomarkers from human urine samples, was collected and monitored over time and we discuss the proposed methods with more standard imputation techniques that could be applied by ignoring either the multivariate or the longitudinal dimension. Our simulations demonstrate the superiority of the recursive partitioning and predictive mean matching algorithm over the other methods in terms of bias, mean squared error and coverage of the LMM parameter estimates when compared to those obtained from a data analysis without missingness, although it comes at the expense of high computational costs. It is worthwhile reconsidering much faster methodologies like the one relying on PCA.Keywords: high-dimensional datalongitudinal datalinear mixed modelsmissing datamultiple imputationprincipal component analysispenalized regressionrecursive partitioningDisclaimerAs a service to authors and researchers we are providing this version of an accepted manuscript (AM). Copyediting, typesetting, and review of the resulting proofs will be undertaken on this manuscript before final publication of the Version of Record (VoR). During production and pre-press, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal relate to these versions also.","PeriodicalId":342642,"journal":{"name":"The American Statistician","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Missing data imputation with high-dimensional data\",\"authors\":\"Alberto Brini, Edwin R. van den Heuvel\",\"doi\":\"10.1080/00031305.2023.2259962\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"AbstractImputation of missing data in high-dimensional datasets with more variables P than samples N, P≫N, is hampered by the data dimensionality. For multivariate imputation, the covariance matrix is ill conditioned and cannot be properly estimated. For fully conditional imputation, the regression models for imputation cannot include all the variables. Thus, the high dimension requires special imputation approaches. In this paper, we provide an overview and realistic comparisons of imputation approaches for high-dimensional data when applied to a linear mixed modelling (LMM) framework. We examine approaches from three different classes using simulation studies: multiple imputation with penalized regression, multiple imputation with recursive partitioning and predictive mean matching and multiple imputation with Principal Component Analysis (PCA). We illustrate the methods on a real case study where a multivariate outcome, i.e., an extracted set of correlated biomarkers from human urine samples, was collected and monitored over time and we discuss the proposed methods with more standard imputation techniques that could be applied by ignoring either the multivariate or the longitudinal dimension. Our simulations demonstrate the superiority of the recursive partitioning and predictive mean matching algorithm over the other methods in terms of bias, mean squared error and coverage of the LMM parameter estimates when compared to those obtained from a data analysis without missingness, although it comes at the expense of high computational costs. It is worthwhile reconsidering much faster methodologies like the one relying on PCA.Keywords: high-dimensional datalongitudinal datalinear mixed modelsmissing datamultiple imputationprincipal component analysispenalized regressionrecursive partitioningDisclaimerAs a service to authors and researchers we are providing this version of an accepted manuscript (AM). Copyediting, typesetting, and review of the resulting proofs will be undertaken on this manuscript before final publication of the Version of Record (VoR). During production and pre-press, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal relate to these versions also.\",\"PeriodicalId\":342642,\"journal\":{\"name\":\"The American Statistician\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The American Statistician\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/00031305.2023.2259962\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The American Statistician","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/00031305.2023.2259962","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在变量P多于样本N (P = N)的高维数据集中，缺失数据的计算受到数据维数的限制。对于多元插值，协方差矩阵是病态的，不能正确估计。对于全条件归算，归算的回归模型不能包括所有的变量。因此，高维需要特殊的归算方法。在本文中，我们提供了一个概述和现实的比较，当应用于一个线性混合建模(LMM)框架的高维数据的imputation方法。我们使用模拟研究检查了三种不同类别的方法:惩罚回归的多重输入，递归划分和预测均值匹配的多重输入以及主成分分析(PCA)的多重输入。我们在一个真实的案例研究中说明了这些方法，其中一个多变量结果，即从人类尿液样本中提取的一组相关生物标志物，随着时间的推移被收集和监测，我们讨论了采用更标准的imputation技术提出的方法，这些技术可以通过忽略多变量或纵向维度来应用。我们的模拟证明了递归划分和预测均值匹配算法在偏差、均方误差和LMM参数估计的覆盖范围方面优于其他方法，尽管这是以高计算成本为代价的，但与从无缺失的数据分析中获得的结果相比。重新考虑更快的方法(比如依赖于PCA的方法)是值得的。关键词:高维数据纵向数据线性混合模型缺失数据多元假设主成分分析惩罚回归递归划分免责声明作为对作者和研究人员的服务，我们提供此版本的已接受稿件(AM)。在最终出版版本记录(VoR)之前，将对该手稿进行编辑、排版和审查。在制作和印前，可能会发现可能影响内容的错误，所有适用于期刊的法律免责声明也与这些版本有关。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Missing data imputation with high-dimensional data

AbstractImputation of missing data in high-dimensional datasets with more variables P than samples N, P≫N, is hampered by the data dimensionality. For multivariate imputation, the covariance matrix is ill conditioned and cannot be properly estimated. For fully conditional imputation, the regression models for imputation cannot include all the variables. Thus, the high dimension requires special imputation approaches. In this paper, we provide an overview and realistic comparisons of imputation approaches for high-dimensional data when applied to a linear mixed modelling (LMM) framework. We examine approaches from three different classes using simulation studies: multiple imputation with penalized regression, multiple imputation with recursive partitioning and predictive mean matching and multiple imputation with Principal Component Analysis (PCA). We illustrate the methods on a real case study where a multivariate outcome, i.e., an extracted set of correlated biomarkers from human urine samples, was collected and monitored over time and we discuss the proposed methods with more standard imputation techniques that could be applied by ignoring either the multivariate or the longitudinal dimension. Our simulations demonstrate the superiority of the recursive partitioning and predictive mean matching algorithm over the other methods in terms of bias, mean squared error and coverage of the LMM parameter estimates when compared to those obtained from a data analysis without missingness, although it comes at the expense of high computational costs. It is worthwhile reconsidering much faster methodologies like the one relying on PCA.Keywords: high-dimensional datalongitudinal datalinear mixed modelsmissing datamultiple imputationprincipal component analysispenalized regressionrecursive partitioningDisclaimerAs a service to authors and researchers we are providing this version of an accepted manuscript (AM). Copyediting, typesetting, and review of the resulting proofs will be undertaken on this manuscript before final publication of the Version of Record (VoR). During production and pre-press, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal relate to these versions also.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The American Statistician

自引率

0.00%

发文量