Joshua C. Macdonald, Javier Blanco-Portillo, Marcus W. Feldman, Yoav Ram
{"title":"用贝叶斯方法估算文化数据的重要主成分数量","authors":"Joshua C. Macdonald, Javier Blanco-Portillo, Marcus W. Feldman, Yoav Ram","doi":"arxiv-2409.12129","DOIUrl":null,"url":null,"abstract":"Principal component analysis (PCA) is often used to analyze multivariate data\ntogether with cluster analysis, which depends on the number of principal\ncomponents used. It is therefore important to determine the number of\nsignificant principal components (PCs) extracted from a data set. Here we use a\nvariational Bayesian version of classical PCA, to develop a new method for\nestimating the number of significant PCs in contexts where the number of\nsamples is of a similar to or greater than the number of features. This\neliminates guesswork and potential bias in manually determining the number of\nprincipal components and avoids overestimation of variance by filtering noise.\nThis framework can be applied to datasets of different shapes (number of rows\nand columns), different data types (binary, ordinal, categorical, continuous),\nand with noisy and missing data. Therefore, it is especially useful for data\nwith arbitrary encodings and similar numbers of rows and columns, such as\ncultural, ecological, morphological, and behavioral datasets. We tested our\nmethod on both synthetic data and empirical datasets and found that it may\nunderestimate but not overestimate the number of principal components for the\nsynthetic data. A small number of components was found for each empirical\ndataset. These results suggest that it is broadly applicable across the life\nsciences.","PeriodicalId":501172,"journal":{"name":"arXiv - STAT - Applications","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bayesian estimation of the number of significant principal components for cultural data\",\"authors\":\"Joshua C. Macdonald, Javier Blanco-Portillo, Marcus W. Feldman, Yoav Ram\",\"doi\":\"arxiv-2409.12129\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Principal component analysis (PCA) is often used to analyze multivariate data\\ntogether with cluster analysis, which depends on the number of principal\\ncomponents used. It is therefore important to determine the number of\\nsignificant principal components (PCs) extracted from a data set. Here we use a\\nvariational Bayesian version of classical PCA, to develop a new method for\\nestimating the number of significant PCs in contexts where the number of\\nsamples is of a similar to or greater than the number of features. This\\neliminates guesswork and potential bias in manually determining the number of\\nprincipal components and avoids overestimation of variance by filtering noise.\\nThis framework can be applied to datasets of different shapes (number of rows\\nand columns), different data types (binary, ordinal, categorical, continuous),\\nand with noisy and missing data. Therefore, it is especially useful for data\\nwith arbitrary encodings and similar numbers of rows and columns, such as\\ncultural, ecological, morphological, and behavioral datasets. We tested our\\nmethod on both synthetic data and empirical datasets and found that it may\\nunderestimate but not overestimate the number of principal components for the\\nsynthetic data. A small number of components was found for each empirical\\ndataset. These results suggest that it is broadly applicable across the life\\nsciences.\",\"PeriodicalId\":501172,\"journal\":{\"name\":\"arXiv - STAT - Applications\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.12129\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
主成分分析(PCA)通常与聚类分析一起用于分析多变量数据,而聚类分析则取决于所使用的主成分数量。因此,确定从数据集中提取的重要主成分(PC)的数量非常重要。在此,我们使用经典 PCA 的变异贝叶斯版本,开发出一种新方法,可以在样本数量与特征数量相近或大于特征数量的情况下,估算出重要 PC 的数量。该框架可应用于不同形状(行列数)、不同数据类型(二元、序数、分类、连续)以及存在噪声和缺失数据的数据集。因此,该方法尤其适用于具有任意编码和类似行列数的数据,如文化、生态、形态和行为数据集。我们在合成数据和经验数据集上测试了我们的方法,发现它可能会低估但不会高估合成数据的主成分数。每个经验数据集的主成分数量都很少。这些结果表明,该方法广泛适用于生命科学领域。
Bayesian estimation of the number of significant principal components for cultural data
Principal component analysis (PCA) is often used to analyze multivariate data
together with cluster analysis, which depends on the number of principal
components used. It is therefore important to determine the number of
significant principal components (PCs) extracted from a data set. Here we use a
variational Bayesian version of classical PCA, to develop a new method for
estimating the number of significant PCs in contexts where the number of
samples is of a similar to or greater than the number of features. This
eliminates guesswork and potential bias in manually determining the number of
principal components and avoids overestimation of variance by filtering noise.
This framework can be applied to datasets of different shapes (number of rows
and columns), different data types (binary, ordinal, categorical, continuous),
and with noisy and missing data. Therefore, it is especially useful for data
with arbitrary encodings and similar numbers of rows and columns, such as
cultural, ecological, morphological, and behavioral datasets. We tested our
method on both synthetic data and empirical datasets and found that it may
underestimate but not overestimate the number of principal components for the
synthetic data. A small number of components was found for each empirical
dataset. These results suggest that it is broadly applicable across the life
sciences.