A versatile informative diffusion model for single-cell ATAC-seq data generation and analysis

Lei Huang, Lei Xiong, Na Sun, Zunpeng Liu, Ka-Chun Wong, Manolis Kellis
{"title":"A versatile informative diffusion model for single-cell ATAC-seq data generation and analysis","authors":"Lei Huang, Lei Xiong, Na Sun, Zunpeng Liu, Ka-Chun Wong, Manolis Kellis","doi":"arxiv-2408.14801","DOIUrl":null,"url":null,"abstract":"The rapid advancement of single-cell ATAC sequencing (scATAC-seq)\ntechnologies holds great promise for investigating the heterogeneity of\nepigenetic landscapes at the cellular level. The amplification process in\nscATAC-seq experiments often introduces noise due to dropout events, which\nresults in extreme sparsity that hinders accurate analysis. Consequently, there\nis a significant demand for the generation of high-quality scATAC-seq data in\nsilico. Furthermore, current methodologies are typically task-specific, lacking\na versatile framework capable of handling multiple tasks within a single model.\nIn this work, we propose ATAC-Diff, a versatile framework, which is based on a\nlatent diffusion model conditioned on the latent auxiliary variables to adapt\nfor various tasks. ATAC-Diff is the first diffusion model for the scATAC-seq\ndata generation and analysis, composed of auxiliary modules encoding the latent\nhigh-level variables to enable the model to learn the semantic information to\nsample high-quality data. Gaussian Mixture Model (GMM) as the latent prior and\nauxiliary decoder, the yield variables reserve the refined genomic information\nbeneficial for downstream analyses. Another innovation is the incorporation of\nmutual information between observed and hidden variables as a regularization\nterm to prevent the model from decoupling from latent variables. Through\nextensive experiments, we demonstrate that ATAC-Diff achieves high performance\nin both generation and analysis tasks, outperforming state-of-the-art models.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.14801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid advancement of single-cell ATAC sequencing (scATAC-seq) technologies holds great promise for investigating the heterogeneity of epigenetic landscapes at the cellular level. The amplification process in scATAC-seq experiments often introduces noise due to dropout events, which results in extreme sparsity that hinders accurate analysis. Consequently, there is a significant demand for the generation of high-quality scATAC-seq data in silico. Furthermore, current methodologies are typically task-specific, lacking a versatile framework capable of handling multiple tasks within a single model. In this work, we propose ATAC-Diff, a versatile framework, which is based on a latent diffusion model conditioned on the latent auxiliary variables to adapt for various tasks. ATAC-Diff is the first diffusion model for the scATAC-seq data generation and analysis, composed of auxiliary modules encoding the latent high-level variables to enable the model to learn the semantic information to sample high-quality data. Gaussian Mixture Model (GMM) as the latent prior and auxiliary decoder, the yield variables reserve the refined genomic information beneficial for downstream analyses. Another innovation is the incorporation of mutual information between observed and hidden variables as a regularization term to prevent the model from decoupling from latent variables. Through extensive experiments, we demonstrate that ATAC-Diff achieves high performance in both generation and analysis tasks, outperforming state-of-the-art models.
用于单细胞 ATAC-seq 数据生成和分析的多功能信息扩散模型
单细胞ATAC测序(scATAC-seq)技术的迅速发展为研究细胞水平表观遗传景观的异质性带来了巨大的希望。scATAC-seq 实验的扩增过程往往会因丢弃事件而引入噪声,从而导致极度稀疏,阻碍了精确分析。因此,对在内部生成高质量的 scATAC-seq 数据有很大的需求。在这项工作中,我们提出了 ATAC-Diff,一个基于潜在辅助变量条件的潜在扩散模型的多功能框架,以适应各种任务。ATAC-Diff 是第一个用于 scATAC-seq 数据生成和分析的扩散模型,由编码潜在高层次变量的辅助模块组成,使模型能够学习语义信息,从而对高质量数据进行采样。高斯混杂模型(GMM)作为潜在先验和辅助解码器,产生的变量保留了精炼的基因组信息,有利于下游分析。另一项创新是将观测变量和隐藏变量之间的相互信息作为正则化项,以防止模型与潜在变量脱钩。通过大量的实验,我们证明 ATAC-Diff 在生成和分析任务中都取得了很高的性能,超过了最先进的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信