Lei Huang, Lei Xiong, Na Sun, Zunpeng Liu, Ka-Chun Wong, Manolis Kellis
{"title":"A versatile informative diffusion model for single-cell ATAC-seq data generation and analysis","authors":"Lei Huang, Lei Xiong, Na Sun, Zunpeng Liu, Ka-Chun Wong, Manolis Kellis","doi":"arxiv-2408.14801","DOIUrl":null,"url":null,"abstract":"The rapid advancement of single-cell ATAC sequencing (scATAC-seq)\ntechnologies holds great promise for investigating the heterogeneity of\nepigenetic landscapes at the cellular level. The amplification process in\nscATAC-seq experiments often introduces noise due to dropout events, which\nresults in extreme sparsity that hinders accurate analysis. Consequently, there\nis a significant demand for the generation of high-quality scATAC-seq data in\nsilico. Furthermore, current methodologies are typically task-specific, lacking\na versatile framework capable of handling multiple tasks within a single model.\nIn this work, we propose ATAC-Diff, a versatile framework, which is based on a\nlatent diffusion model conditioned on the latent auxiliary variables to adapt\nfor various tasks. ATAC-Diff is the first diffusion model for the scATAC-seq\ndata generation and analysis, composed of auxiliary modules encoding the latent\nhigh-level variables to enable the model to learn the semantic information to\nsample high-quality data. Gaussian Mixture Model (GMM) as the latent prior and\nauxiliary decoder, the yield variables reserve the refined genomic information\nbeneficial for downstream analyses. Another innovation is the incorporation of\nmutual information between observed and hidden variables as a regularization\nterm to prevent the model from decoupling from latent variables. Through\nextensive experiments, we demonstrate that ATAC-Diff achieves high performance\nin both generation and analysis tasks, outperforming state-of-the-art models.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.14801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid advancement of single-cell ATAC sequencing (scATAC-seq)
technologies holds great promise for investigating the heterogeneity of
epigenetic landscapes at the cellular level. The amplification process in
scATAC-seq experiments often introduces noise due to dropout events, which
results in extreme sparsity that hinders accurate analysis. Consequently, there
is a significant demand for the generation of high-quality scATAC-seq data in
silico. Furthermore, current methodologies are typically task-specific, lacking
a versatile framework capable of handling multiple tasks within a single model.
In this work, we propose ATAC-Diff, a versatile framework, which is based on a
latent diffusion model conditioned on the latent auxiliary variables to adapt
for various tasks. ATAC-Diff is the first diffusion model for the scATAC-seq
data generation and analysis, composed of auxiliary modules encoding the latent
high-level variables to enable the model to learn the semantic information to
sample high-quality data. Gaussian Mixture Model (GMM) as the latent prior and
auxiliary decoder, the yield variables reserve the refined genomic information
beneficial for downstream analyses. Another innovation is the incorporation of
mutual information between observed and hidden variables as a regularization
term to prevent the model from decoupling from latent variables. Through
extensive experiments, we demonstrate that ATAC-Diff achieves high performance
in both generation and analysis tasks, outperforming state-of-the-art models.