Learning Causal Representations of Single Cells via Sparse Mechanism Shift Modeling

CLEaR Pub Date : 2022-11-07 DOI:10.48550/arXiv.2211.03553

Romain Lopez, Natavsa Tagasovska, Stephen Ra, K. Cho, J. Pritchard, A. Regev

{"title":"Learning Causal Representations of Single Cells via Sparse Mechanism Shift Modeling","authors":"Romain Lopez, Natavsa Tagasovska, Stephen Ra, K. Cho, J. Pritchard, A. Regev","doi":"10.48550/arXiv.2211.03553","DOIUrl":null,"url":null,"abstract":"Latent variable models such as the Variational Auto-Encoder (VAE) have become a go-to tool for analyzing biological data, especially in the field of single-cell genomics. One remaining challenge is the interpretability of latent variables as biological processes that define a cell's identity. Outside of biological applications, this problem is commonly referred to as learning disentangled representations. Although several disentanglement-promoting variants of the VAE were introduced, and applied to single-cell genomics data, this task has been shown to be infeasible from independent and identically distributed measurements, without additional structure. Instead, recent methods propose to leverage non-stationary data, as well as the sparse mechanism shift assumption in order to learn disentangled representations with a causal semantic. Here, we extend the application of these methodological advances to the analysis of single-cell genomics data with genetic or chemical perturbations. More precisely, we propose a deep generative model of single-cell gene expression data for which each perturbation is treated as a stochastic intervention targeting an unknown, but sparse, subset of latent variables. We benchmark these methods on simulated single-cell data to evaluate their performance at latent units recovery, causal target identification and out-of-domain generalization. Finally, we apply those approaches to two real-world large-scale gene perturbation data sets and find that models that exploit the sparse mechanism shift hypothesis surpass contemporary methods on a transfer learning task. We implement our new model and benchmarks using the scvi-tools library, and release it as open-source software at https://github.com/Genentech/sVAE.","PeriodicalId":171742,"journal":{"name":"CLEaR","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CLEaR","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2211.03553","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Latent variable models such as the Variational Auto-Encoder (VAE) have become a go-to tool for analyzing biological data, especially in the field of single-cell genomics. One remaining challenge is the interpretability of latent variables as biological processes that define a cell's identity. Outside of biological applications, this problem is commonly referred to as learning disentangled representations. Although several disentanglement-promoting variants of the VAE were introduced, and applied to single-cell genomics data, this task has been shown to be infeasible from independent and identically distributed measurements, without additional structure. Instead, recent methods propose to leverage non-stationary data, as well as the sparse mechanism shift assumption in order to learn disentangled representations with a causal semantic. Here, we extend the application of these methodological advances to the analysis of single-cell genomics data with genetic or chemical perturbations. More precisely, we propose a deep generative model of single-cell gene expression data for which each perturbation is treated as a stochastic intervention targeting an unknown, but sparse, subset of latent variables. We benchmark these methods on simulated single-cell data to evaluate their performance at latent units recovery, causal target identification and out-of-domain generalization. Finally, we apply those approaches to two real-world large-scale gene perturbation data sets and find that models that exploit the sparse mechanism shift hypothesis surpass contemporary methods on a transfer learning task. We implement our new model and benchmarks using the scvi-tools library, and release it as open-source software at https://github.com/Genentech/sVAE.

查看原文本刊更多论文

基于稀疏机制移位建模的单细胞因果表示学习

变分自编码器(VAE)等潜在变量模型已成为分析生物数据的首选工具，特别是在单细胞基因组学领域。一个仍然存在的挑战是潜在变量作为定义细胞身份的生物过程的可解释性。在生物应用之外，这个问题通常被称为学习解纠缠表示。尽管引入了几种促进解缠的VAE变体，并将其应用于单细胞基因组学数据，但在没有额外结构的情况下，这项任务已被证明在独立和相同分布的测量中是不可行的。相反，最近的方法提出利用非平稳数据，以及稀疏机制转移假设，以学习具有因果语义的解纠缠表示。在这里，我们将这些方法的应用扩展到具有遗传或化学扰动的单细胞基因组学数据的分析。更准确地说，我们提出了一个单细胞基因表达数据的深度生成模型，其中每个扰动都被视为针对未知但稀疏的潜在变量子集的随机干预。我们在模拟单细胞数据上对这些方法进行基准测试，以评估它们在潜在单位恢复、因果目标识别和域外泛化方面的性能。最后，我们将这些方法应用于两个真实世界的大规模基因扰动数据集，发现利用稀疏机制转移假设的模型在迁移学习任务上优于当代方法。我们使用scvi-tools库实现我们的新模型和基准测试，并将其作为开源软件在https://github.com/Genentech/sVAE上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

CLEaR

自引率

0.00%

发文量