Identifying patterns differing between high-dimensional datasets with generalized contrastive PCA.

IF 3.6 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

PLoS Computational Biology Pub Date : 2025-02-07 eCollection Date: 2025-02-01 DOI:10.1371/journal.pcbi.1012747

Eliezyer Fermino de Oliveira, Pranjal Garg, Jens Hjerling-Leffler, Renata Batista-Brito, Lucas Sjulson

{"title":"Identifying patterns differing between high-dimensional datasets with generalized contrastive PCA.","authors":"Eliezyer Fermino de Oliveira, Pranjal Garg, Jens Hjerling-Leffler, Renata Batista-Brito, Lucas Sjulson","doi":"10.1371/journal.pcbi.1012747","DOIUrl":null,"url":null,"abstract":"<p><p>High-dimensional data have become ubiquitous in the biological sciences, and it is often desirable to compare two datasets collected under different experimental conditions to extract low-dimensional patterns enriched in one condition. However, traditional dimensionality reduction techniques cannot accomplish this because they operate on only one dataset. Contrastive principal component analysis (cPCA) has been proposed to address this problem, but it has seen little adoption because it requires tuning a hyperparameter resulting in multiple solutions, with no way of knowing which is correct. Moreover, cPCA uses foreground and background conditions that are treated differently, making it ill-suited to compare two experimental conditions symmetrically. Here we describe the development of generalized contrastive PCA (gcPCA), a flexible hyperparameter-free approach that solves these problems. We first provide analyses explaining why cPCA requires a hyperparameter and how gcPCA avoids this requirement. We then describe an open-source gcPCA toolbox containing Python and MATLAB implementations of several variants of gcPCA tailored for different scenarios. Finally, we demonstrate the utility of gcPCA in analyzing diverse high-dimensional biological data, revealing unsupervised detection of hippocampal replay in neurophysiological recordings and heterogeneity of type II diabetes in single-cell RNA sequencing data. As a fast, robust, and easy-to-use comparison method, gcPCA provides a valuable resource facilitating the analysis of diverse high-dimensional datasets to gain new insights into complex biological phenomena.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 2","pages":"e1012747"},"PeriodicalIF":3.6000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11841894/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pcbi.1012747","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

High-dimensional data have become ubiquitous in the biological sciences, and it is often desirable to compare two datasets collected under different experimental conditions to extract low-dimensional patterns enriched in one condition. However, traditional dimensionality reduction techniques cannot accomplish this because they operate on only one dataset. Contrastive principal component analysis (cPCA) has been proposed to address this problem, but it has seen little adoption because it requires tuning a hyperparameter resulting in multiple solutions, with no way of knowing which is correct. Moreover, cPCA uses foreground and background conditions that are treated differently, making it ill-suited to compare two experimental conditions symmetrically. Here we describe the development of generalized contrastive PCA (gcPCA), a flexible hyperparameter-free approach that solves these problems. We first provide analyses explaining why cPCA requires a hyperparameter and how gcPCA avoids this requirement. We then describe an open-source gcPCA toolbox containing Python and MATLAB implementations of several variants of gcPCA tailored for different scenarios. Finally, we demonstrate the utility of gcPCA in analyzing diverse high-dimensional biological data, revealing unsupervised detection of hippocampal replay in neurophysiological recordings and heterogeneity of type II diabetes in single-cell RNA sequencing data. As a fast, robust, and easy-to-use comparison method, gcPCA provides a valuable resource facilitating the analysis of diverse high-dimensional datasets to gain new insights into complex biological phenomena.

查看原文本刊更多论文

用广义对比PCA识别高维数据集之间的不同模式。

高维数据在生物科学中已经变得无处不在，通常需要比较在不同实验条件下收集的两个数据集，以提取在一个条件下丰富的低维模式。然而，传统的降维技术不能做到这一点，因为它们只对一个数据集进行操作。对比主成分分析（cPCA）已经被提出来解决这个问题，但它很少被采用，因为它需要调优一个超参数，从而产生多个解决方案，而无法知道哪个是正确的。此外，cPCA使用不同处理的前景和背景条件，使得它不适合对称比较两个实验条件。在这里，我们描述了广义对比PCA （gcPCA）的发展，这是一种解决这些问题的灵活的超参数无方法。我们首先提供分析，解释为什么cPCA需要一个超参数，以及gcPCA如何避免这个要求。然后，我们描述了一个开源的gcPCA工具箱，其中包含针对不同场景定制的几种gcPCA变体的Python和MATLAB实现。最后，我们展示了gcPCA在分析各种高维生物学数据中的效用，揭示了神经生理学记录中海马回放的无监督检测和单细胞RNA测序数据中II型糖尿病的异质性。gcPCA作为一种快速、鲁棒且易于使用的比较方法，为分析各种高维数据集提供了宝贵的资源，从而对复杂的生物现象有了新的认识。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS Computational Biology BIOCHEMICAL RESEARCH METHODS-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

7.10

自引率

4.70%

发文量

820

审稿时长

2.5 months

期刊介绍： PLOS Computational Biology features works of exceptional significance that further our understanding of living systems at all scales—from molecules and cells, to patient populations and ecosystems—through the application of computational methods. Readers include life and computational scientists, who can take the important findings presented here to the next level of discovery. Research articles must be declared as belonging to a relevant section. More information about the sections can be found in the submission guidelines. Research articles should model aspects of biological systems, demonstrate both methodological and scientific novelty, and provide profound new biological insights. Generally, reliability and significance of biological discovery through computation should be validated and enriched by experimental studies. Inclusion of experimental validation is not required for publication, but should be referenced where possible. Inclusion of experimental validation of a modest biological discovery through computation does not render a manuscript suitable for PLOS Computational Biology. Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities.