深层结构聚类揭示了RNA测序数据中隐藏的系统性偏差

IF 5.5 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research Pub Date : 2025-09-19 DOI:10.1101/gr.280713.125

Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian

{"title":"深层结构聚类揭示了RNA测序数据中隐藏的系统性偏差","authors":"Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian","doi":"10.1101/gr.280713.125","DOIUrl":null,"url":null,"abstract":"RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional k-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, k-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"27 1","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep structural clustering reveals hidden systematic biases in RNA sequencing data\",\"authors\":\"Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian\",\"doi\":\"10.1101/gr.280713.125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional k-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, k-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.\",\"PeriodicalId\":12678,\"journal\":{\"name\":\"Genome research\",\"volume\":\"27 1\",\"pages\":\"\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genome research\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1101/gr.280713.125\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1101/gr.280713.125","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

RNA测序（RNA-seq）是转录组学分析的关键工具，可以全面探索不同生物背景下的基因表达。然而，RNA-seq数据容易受到各种偏差的影响，这些偏差会严重损害转录物定量的准确性和可靠性。本研究利用创新的无监督变分自编码器-高斯混合模型（VAE-GMM）研究了高维RNA结构对局部测序效率的影响。VAE-GMM通过学习紧凑的潜在表示有效地捕获复杂的高维k-mer结构相似性，从而降低了维数，同时一丝不苟地保留了对偏差识别至关重要的基本结构特征。这种复杂的建模允许精确跟踪局部rna读取转换动态和识别复杂的，经常被忽视的偏差源。我们严格验证了VAE-GMM模型对传统机器学习技术的性能和鲁棒性，包括高斯混合模型（仅限gmm）、基于主成分分析的gmm、k-means聚类和分层聚类。这些验证使用了广泛而多样的数据集，包括合成RNA结构、各种人类细胞系和真实的组织样本，一致地证明了该模型在不同生物系统中的优越多功能性和准确性。此外，测序过程的计算机模拟与实际测序数据密切一致，有力地强化了高维RNA结构在决定测序效率及其对数据质量的影响方面的关键作用。我们的发现为RNA结构介导的测序偏倚的潜在机制提供了有价值的见解。这种更深入的理解使RNA-seq分析更加准确和可靠，并有望在未来的基因组研究中改善转录组数据的解释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep structural clustering reveals hidden systematic biases in RNA sequencing data

RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional k-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, k-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Genome research 生物-生化与分子生物学

CiteScore

12.40

自引率

1.40%

发文量

140

审稿时长

6 months

期刊介绍： Launched in 1995, Genome Research is an international, continuously published, peer-reviewed journal that focuses on research that provides novel insights into the genome biology of all organisms, including advances in genomic medicine. Among the topics considered by the journal are genome structure and function, comparative genomics, molecular evolution, genome-scale quantitative and population genetics, proteomics, epigenomics, and systems biology. The journal also features exciting gene discoveries and reports of cutting-edge computational biology and high-throughput methodologies. New data in these areas are published as research papers, or methods and resource reports that provide novel information on technologies or tools that will be of interest to a broad readership. Complete data sets are presented electronically on the journal''s web site where appropriate. The journal also provides Reviews, Perspectives, and Insight/Outlook articles, which present commentary on the latest advances published both here and elsewhere, placing such progress in its broader biological context.