Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian
{"title":"深层结构聚类揭示了RNA测序数据中隐藏的系统性偏差","authors":"Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian","doi":"10.1101/gr.280713.125","DOIUrl":null,"url":null,"abstract":"RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional <em>k</em>-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, <em>k</em>-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"27 1","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep structural clustering reveals hidden systematic biases in RNA sequencing data\",\"authors\":\"Qiang Su, Yi Long, Deming Gou, Junmin Quan, Xiaoming Zhou, Qizhou Lian\",\"doi\":\"10.1101/gr.280713.125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional <em>k</em>-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, <em>k</em>-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.\",\"PeriodicalId\":12678,\"journal\":{\"name\":\"Genome research\",\"volume\":\"27 1\",\"pages\":\"\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genome research\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1101/gr.280713.125\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1101/gr.280713.125","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
Deep structural clustering reveals hidden systematic biases in RNA sequencing data
RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data is susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised Variational Autoencoder-Gaussian Mixture Model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional k-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked, bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian Mixture Models (GMM-only), Principal Component Analysis-based GMMs, k-means clustering, and Hierarchical Clustering. These validations, using an extensive and diverse array of datasets including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure-mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.
期刊介绍:
Launched in 1995, Genome Research is an international, continuously published, peer-reviewed journal that focuses on research that provides novel insights into the genome biology of all organisms, including advances in genomic medicine.
Among the topics considered by the journal are genome structure and function, comparative genomics, molecular evolution, genome-scale quantitative and population genetics, proteomics, epigenomics, and systems biology. The journal also features exciting gene discoveries and reports of cutting-edge computational biology and high-throughput methodologies.
New data in these areas are published as research papers, or methods and resource reports that provide novel information on technologies or tools that will be of interest to a broad readership. Complete data sets are presented electronically on the journal''s web site where appropriate. The journal also provides Reviews, Perspectives, and Insight/Outlook articles, which present commentary on the latest advances published both here and elsewhere, placing such progress in its broader biological context.