健康调查和生物样本数据的主成分近似和解释

Frontiers Digit. Humanit. Pub Date : 2018-06-26 DOI:10.3389/fdigh.2018.00011

Y. Chao, Hsing-Chien Wu, Chao-Jung Wu, Wei-Chih Chen

{"title":"健康调查和生物样本数据的主成分近似和解释","authors":"Y. Chao, Hsing-Chien Wu, Chao-Jung Wu, Wei-Chih Chen","doi":"10.3389/fdigh.2018.00011","DOIUrl":null,"url":null,"abstract":"Background Increasing numbers of variables in surveys and administrative databases are created. Principal component analysis (PCA) is important to summarize data or reduce dimensionality. However, one disadvantage of using PCA is the interpretability of the principal components (PCs), especially in a high-dimensional database. By analyzing the variance distribution according to PCA loadings and approximating PCs with input variables, we aim to demonstrate the importance of variables based on the proportions of total variances contributed or explained by input variables. Methods There were five data sets of various sizes used to understand the performance of PC approximation: Hitters, SF-12v2 subset of the 2004 to 2011Medical Expenditure Panel Survey (MEPS), and the full set of 1996 to 2011 MESP data, along with two data sets derived from the Canadian Health Measures Survey (CHMS): a spirometry subset with the measures from the first trial of spirometry and a full data set that contained non-redundant variables. The variables in data sets were first centered and scaled before PCA. PCs approximation was studied with two approaches: PCA loadings and PC approximation through forward regression. First, the PC loadings were squared to estimate the variance contribution by variables to PCs. The other method was to use forward-stepwise regression to approximate PCs with all input variables. Results The first few PCs had large variances in each data set. Approximating PCs using stepwise regression could efficiently identify the input variables that explain large portions of PC variances than approximating according to PCA loadings in the data sets. It required fewer numbers of variables to explain more than 80% of the PC variances through stepwise regression. Conclusion Approximating and interpreting PCs with stepwise regression is highly feasible. PC approximation is useful to 1) interpret PCs with input variables, 2) understand the major sources of variances in data sets, 3) select unique sources of information and 4) search and rank input variables according to the proportions of PC variance explained. This can be an approach to systematically understand databases and search for variables that are important to databases.","PeriodicalId":227954,"journal":{"name":"Frontiers Digit. Humanit.","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Principal Component Approximation and Interpretation in Health Survey and Biobank Data\",\"authors\":\"Y. Chao, Hsing-Chien Wu, Chao-Jung Wu, Wei-Chih Chen\",\"doi\":\"10.3389/fdigh.2018.00011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background Increasing numbers of variables in surveys and administrative databases are created. Principal component analysis (PCA) is important to summarize data or reduce dimensionality. However, one disadvantage of using PCA is the interpretability of the principal components (PCs), especially in a high-dimensional database. By analyzing the variance distribution according to PCA loadings and approximating PCs with input variables, we aim to demonstrate the importance of variables based on the proportions of total variances contributed or explained by input variables. Methods There were five data sets of various sizes used to understand the performance of PC approximation: Hitters, SF-12v2 subset of the 2004 to 2011Medical Expenditure Panel Survey (MEPS), and the full set of 1996 to 2011 MESP data, along with two data sets derived from the Canadian Health Measures Survey (CHMS): a spirometry subset with the measures from the first trial of spirometry and a full data set that contained non-redundant variables. The variables in data sets were first centered and scaled before PCA. PCs approximation was studied with two approaches: PCA loadings and PC approximation through forward regression. First, the PC loadings were squared to estimate the variance contribution by variables to PCs. The other method was to use forward-stepwise regression to approximate PCs with all input variables. Results The first few PCs had large variances in each data set. Approximating PCs using stepwise regression could efficiently identify the input variables that explain large portions of PC variances than approximating according to PCA loadings in the data sets. It required fewer numbers of variables to explain more than 80% of the PC variances through stepwise regression. Conclusion Approximating and interpreting PCs with stepwise regression is highly feasible. PC approximation is useful to 1) interpret PCs with input variables, 2) understand the major sources of variances in data sets, 3) select unique sources of information and 4) search and rank input variables according to the proportions of PC variance explained. This can be an approach to systematically understand databases and search for variables that are important to databases.\",\"PeriodicalId\":227954,\"journal\":{\"name\":\"Frontiers Digit. Humanit.\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers Digit. Humanit.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fdigh.2018.00011\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers Digit. Humanit.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdigh.2018.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

在调查和管理数据库中创建了越来越多的变量。主成分分析(PCA)对于总结数据或降维具有重要意义。然而，使用PCA的一个缺点是主成分(pc)的可解释性，特别是在高维数据库中。通过根据PCA负载分析方差分布，并用输入变量近似pc，我们的目标是根据输入变量贡献或解释的总方差的比例来证明变量的重要性。方法使用了五个不同大小的数据集来理解PC近似的性能:Hitters, 2004年至2011年医疗支出小组调查(MEPS)的SF-12v2子集，1996年至2011年MESP的全套数据集，以及来自加拿大健康措施调查(CHMS)的两个数据集:肺活量测定子集，其中包括肺活量测定法第一次试验的测量值，以及包含非冗余变量的完整数据集。在PCA之前，首先对数据集中的变量进行居中和缩放。采用主成分加载和正回归的主成分逼近两种方法研究了主成分逼近。首先，对PC负荷进行平方，以估计变量对PC的方差贡献。另一种方法是使用前向逐步回归来近似具有所有输入变量的pc。结果前几个pc在每个数据集中有较大的差异。使用逐步回归近似PC可以有效地识别解释大部分PC方差的输入变量，而不是根据数据集中的PCA加载近似。通过逐步回归，它需要较少的变量数来解释80%以上的PC差异。结论逐步回归近似解释pc是可行的。PC近似对于以下方面很有用:1)用输入变量解释PC; 2)理解数据集中方差的主要来源;3)选择唯一的信息来源;4)根据所解释的PC方差的比例搜索和排序输入变量。这可以是一种系统地理解数据库和搜索对数据库重要的变量的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Principal Component Approximation and Interpretation in Health Survey and Biobank Data

Background Increasing numbers of variables in surveys and administrative databases are created. Principal component analysis (PCA) is important to summarize data or reduce dimensionality. However, one disadvantage of using PCA is the interpretability of the principal components (PCs), especially in a high-dimensional database. By analyzing the variance distribution according to PCA loadings and approximating PCs with input variables, we aim to demonstrate the importance of variables based on the proportions of total variances contributed or explained by input variables. Methods There were five data sets of various sizes used to understand the performance of PC approximation: Hitters, SF-12v2 subset of the 2004 to 2011Medical Expenditure Panel Survey (MEPS), and the full set of 1996 to 2011 MESP data, along with two data sets derived from the Canadian Health Measures Survey (CHMS): a spirometry subset with the measures from the first trial of spirometry and a full data set that contained non-redundant variables. The variables in data sets were first centered and scaled before PCA. PCs approximation was studied with two approaches: PCA loadings and PC approximation through forward regression. First, the PC loadings were squared to estimate the variance contribution by variables to PCs. The other method was to use forward-stepwise regression to approximate PCs with all input variables. Results The first few PCs had large variances in each data set. Approximating PCs using stepwise regression could efficiently identify the input variables that explain large portions of PC variances than approximating according to PCA loadings in the data sets. It required fewer numbers of variables to explain more than 80% of the PC variances through stepwise regression. Conclusion Approximating and interpreting PCs with stepwise regression is highly feasible. PC approximation is useful to 1) interpret PCs with input variables, 2) understand the major sources of variances in data sets, 3) select unique sources of information and 4) search and rank input variables according to the proportions of PC variance explained. This can be an approach to systematically understand databases and search for variables that are important to databases.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers Digit. Humanit.

自引率

0.00%

发文量