Sultan Sevgi Turgut Ögme, Nizamettin Aydin, Zeyneb Kurt
{"title":"在单细胞RNA-Seq数据集中创建合成细胞的混合关注专家嵌入基于流的生成模型。","authors":"Sultan Sevgi Turgut Ögme, Nizamettin Aydin, Zeyneb Kurt","doi":"10.1371/journal.pcbi.1013525","DOIUrl":null,"url":null,"abstract":"<p><p>Single-cell RNA-seq (scRNAseq) analyses performed at the cellular level aim to understand the cellular landscape of tissue sections, offer insights into rare cell-types, and identify marker genes for annotating distinct cell types. ScRNAseq analyses are widely applied to cancer research to understand tumor heterogeneity, disease progression, and resistance to therapy. Single-cell data processing is a challenging task due to its high-dimensionality, sparsity, and having imbalanced class(cell-type) distributions. An accurate cell-type identification is highly dependent on preprocessing and quality control steps. To address these issues, generative models have been widely used in recent years. Techniques frequently used include Variational Autoencoders (VAE), Generative Adversarial Networks (GANs), Gaussian-based methods, and, more recently, Flow-based (FB) generative models. We developed a Masked Affine Autoregressive transform-embedded FB (MAF-FB) model. Then, to improve MAF-FB further, we incorporated a mixture of experts (MOE) of attention mechanisms on top of it, resulting in our proposed MOE-FB model. We conducted a comparative analysis of fundamental generative models, aiming to serve as a preliminary guidance for developing novel automated scRNAseq data analysis systems. We performed a large-scale analysis by combiningfour datasets derived from pancreatic tissue sections and for further generalizability assessments, we employed Peripheral Blood Mononuclear Cells (PBMC68K and PBMC3K) and Human Cell Atlas Bone Marrow (HCA-BM10K) datasets. We utilized VAE, GAN, Gaussian Copula, and Automated Cell-Type-informed Introspective Variational Autoencoder (ACTIVA), and compared them against our two novel FB models, MAF-FB and MOE-FB for ScRnaseq synthesis. To evaluate the performances of generative models, we used various discrepancy metrics and performed automated cell-type classification tasks. We also identified differentially expressed genes for each cell type, and inferred cell-cell interactions based on ligand-receptor bindings across distinct cell-type pairs. Among the generative models, FB models, especially MOE-FB, consistently outperformed others across all experimental setups in both discrepancy metrics with comparison to the baseline test set and cell-type classification tasks (with an F1-score of 0.90 precision of 0.89 and recall of 0.92 for the integrated pancreatic datasets). MOE-FB produced biologically more relevant synthetic data, and ligand-receptor-based cell-cell interactions inferred from the synthetic cells closely resemble the original data, achieving an RMSE of 0.65 against the corresponding pancreatic test set. These findings highlight the potential and promising use of FB models, especially MOE-FB, in scRNAseq analyses.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 10","pages":"e1013525"},"PeriodicalIF":3.6000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12500167/pdf/","citationCount":"0","resultStr":"{\"title\":\"A mixture of attention experts-embedded flow-based generative model to create synthetic cells in single-cell RNA-Seq datasets.\",\"authors\":\"Sultan Sevgi Turgut Ögme, Nizamettin Aydin, Zeyneb Kurt\",\"doi\":\"10.1371/journal.pcbi.1013525\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Single-cell RNA-seq (scRNAseq) analyses performed at the cellular level aim to understand the cellular landscape of tissue sections, offer insights into rare cell-types, and identify marker genes for annotating distinct cell types. ScRNAseq analyses are widely applied to cancer research to understand tumor heterogeneity, disease progression, and resistance to therapy. Single-cell data processing is a challenging task due to its high-dimensionality, sparsity, and having imbalanced class(cell-type) distributions. An accurate cell-type identification is highly dependent on preprocessing and quality control steps. To address these issues, generative models have been widely used in recent years. Techniques frequently used include Variational Autoencoders (VAE), Generative Adversarial Networks (GANs), Gaussian-based methods, and, more recently, Flow-based (FB) generative models. We developed a Masked Affine Autoregressive transform-embedded FB (MAF-FB) model. Then, to improve MAF-FB further, we incorporated a mixture of experts (MOE) of attention mechanisms on top of it, resulting in our proposed MOE-FB model. We conducted a comparative analysis of fundamental generative models, aiming to serve as a preliminary guidance for developing novel automated scRNAseq data analysis systems. We performed a large-scale analysis by combiningfour datasets derived from pancreatic tissue sections and for further generalizability assessments, we employed Peripheral Blood Mononuclear Cells (PBMC68K and PBMC3K) and Human Cell Atlas Bone Marrow (HCA-BM10K) datasets. We utilized VAE, GAN, Gaussian Copula, and Automated Cell-Type-informed Introspective Variational Autoencoder (ACTIVA), and compared them against our two novel FB models, MAF-FB and MOE-FB for ScRnaseq synthesis. To evaluate the performances of generative models, we used various discrepancy metrics and performed automated cell-type classification tasks. We also identified differentially expressed genes for each cell type, and inferred cell-cell interactions based on ligand-receptor bindings across distinct cell-type pairs. Among the generative models, FB models, especially MOE-FB, consistently outperformed others across all experimental setups in both discrepancy metrics with comparison to the baseline test set and cell-type classification tasks (with an F1-score of 0.90 precision of 0.89 and recall of 0.92 for the integrated pancreatic datasets). MOE-FB produced biologically more relevant synthetic data, and ligand-receptor-based cell-cell interactions inferred from the synthetic cells closely resemble the original data, achieving an RMSE of 0.65 against the corresponding pancreatic test set. These findings highlight the potential and promising use of FB models, especially MOE-FB, in scRNAseq analyses.</p>\",\"PeriodicalId\":20241,\"journal\":{\"name\":\"PLoS Computational Biology\",\"volume\":\"21 10\",\"pages\":\"e1013525\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12500167/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS Computational Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pcbi.1013525\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/10/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pcbi.1013525","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
A mixture of attention experts-embedded flow-based generative model to create synthetic cells in single-cell RNA-Seq datasets.
Single-cell RNA-seq (scRNAseq) analyses performed at the cellular level aim to understand the cellular landscape of tissue sections, offer insights into rare cell-types, and identify marker genes for annotating distinct cell types. ScRNAseq analyses are widely applied to cancer research to understand tumor heterogeneity, disease progression, and resistance to therapy. Single-cell data processing is a challenging task due to its high-dimensionality, sparsity, and having imbalanced class(cell-type) distributions. An accurate cell-type identification is highly dependent on preprocessing and quality control steps. To address these issues, generative models have been widely used in recent years. Techniques frequently used include Variational Autoencoders (VAE), Generative Adversarial Networks (GANs), Gaussian-based methods, and, more recently, Flow-based (FB) generative models. We developed a Masked Affine Autoregressive transform-embedded FB (MAF-FB) model. Then, to improve MAF-FB further, we incorporated a mixture of experts (MOE) of attention mechanisms on top of it, resulting in our proposed MOE-FB model. We conducted a comparative analysis of fundamental generative models, aiming to serve as a preliminary guidance for developing novel automated scRNAseq data analysis systems. We performed a large-scale analysis by combiningfour datasets derived from pancreatic tissue sections and for further generalizability assessments, we employed Peripheral Blood Mononuclear Cells (PBMC68K and PBMC3K) and Human Cell Atlas Bone Marrow (HCA-BM10K) datasets. We utilized VAE, GAN, Gaussian Copula, and Automated Cell-Type-informed Introspective Variational Autoencoder (ACTIVA), and compared them against our two novel FB models, MAF-FB and MOE-FB for ScRnaseq synthesis. To evaluate the performances of generative models, we used various discrepancy metrics and performed automated cell-type classification tasks. We also identified differentially expressed genes for each cell type, and inferred cell-cell interactions based on ligand-receptor bindings across distinct cell-type pairs. Among the generative models, FB models, especially MOE-FB, consistently outperformed others across all experimental setups in both discrepancy metrics with comparison to the baseline test set and cell-type classification tasks (with an F1-score of 0.90 precision of 0.89 and recall of 0.92 for the integrated pancreatic datasets). MOE-FB produced biologically more relevant synthetic data, and ligand-receptor-based cell-cell interactions inferred from the synthetic cells closely resemble the original data, achieving an RMSE of 0.65 against the corresponding pancreatic test set. These findings highlight the potential and promising use of FB models, especially MOE-FB, in scRNAseq analyses.
期刊介绍:
PLOS Computational Biology features works of exceptional significance that further our understanding of living systems at all scales—from molecules and cells, to patient populations and ecosystems—through the application of computational methods. Readers include life and computational scientists, who can take the important findings presented here to the next level of discovery.
Research articles must be declared as belonging to a relevant section. More information about the sections can be found in the submission guidelines.
Research articles should model aspects of biological systems, demonstrate both methodological and scientific novelty, and provide profound new biological insights.
Generally, reliability and significance of biological discovery through computation should be validated and enriched by experimental studies. Inclusion of experimental validation is not required for publication, but should be referenced where possible. Inclusion of experimental validation of a modest biological discovery through computation does not render a manuscript suitable for PLOS Computational Biology.
Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities.