Eliseos J Mucaki, Wenhan Zhang, Aryamaan Saha, Sabina Trebinjac, Sharon Nofech-Mozes, Eileen Rakovitch, Vanessa Dumeaux, Michael T Hallett
{"title":"福尔马林固定石蜡包埋材料转录组学的生成和整合建模。","authors":"Eliseos J Mucaki, Wenhan Zhang, Aryamaan Saha, Sabina Trebinjac, Sharon Nofech-Mozes, Eileen Rakovitch, Vanessa Dumeaux, Michael T Hallett","doi":"10.1186/s12967-025-07031-y","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Formalin-fixed paraffin embedded (FFPE) samples suffer from the degradation of nucleic acids, a problem that becomes particularly acute with samples stored for extended periods. It remains challenging to profile FFPE using high-throughput sequencing technologies including RNA-sequencing, and the resulting FFPE RNA-seq (fRNA-seq) data has a high rate of transcript dropout, a property shared with single cell RNA-seq. Transcript counts also have high variance and are prone to extreme values, together making downstream analyses extremely challenging.</p><p><strong>Methods: </strong>We introduce the PaRaffin Embedded Formalin-FixEd Cleaning Tool (PREFFECT), a probabilistic framework for the analysis of fRNA-seq data. PREFFECT uses generative models to fit distributions to observed expression counts while adjusting for technical and biological variables. The framework can exploit multiple expression profiles generated from matched tissues for a single sample (e.g., a tumor and morphologically normal tissue) in order to stabilize profiles and impute missing counts. PREFFECT can also leverage sample-sample adjacency networks that assist graph attention mechanisms to identify the most informative correlations in the data.</p><p><strong>Results: </strong>We evaluated the distribution of transcript counts across a compendium of fRNA-seq datasets, finding the negative binomial distribution best fits the data with little evidence supporting zero-inflated extensions. We use this knowledge in the design of PREFFECT. We show that PREFFECT can accurately impute missing values from fRNAseq count matrices and adjust for batch effects. The inclusion of sample-sample adjacency networks and multiple tissues were shown to enhance sample clustering.</p><p><strong>Conclusions: </strong>The vast majority of studies to date contain at most a few hundred profiles, making it challenging to correctly infer good statistical fits for each transcript especially in complex cohorts, given the noisy, incomplete and heterogeneous nature of the data. The integrative and generative approach of PREFFECT provides better and more specific model fits than generic bulk RNA-seq tools, especially when more advanced PREFFECT models provide matched profiles are included in the analysis. The transformed data can be directly used with many well-established tools for downstream analysis tasks, empowering its use in clinical biomarker studies and diagnostics.</p>","PeriodicalId":17458,"journal":{"name":"Journal of Translational Medicine","volume":"23 1","pages":"1023"},"PeriodicalIF":7.5000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12486589/pdf/","citationCount":"0","resultStr":"{\"title\":\"Generative and integrative modeling for transcriptomics with formalin fixed paraffin embedded material.\",\"authors\":\"Eliseos J Mucaki, Wenhan Zhang, Aryamaan Saha, Sabina Trebinjac, Sharon Nofech-Mozes, Eileen Rakovitch, Vanessa Dumeaux, Michael T Hallett\",\"doi\":\"10.1186/s12967-025-07031-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Formalin-fixed paraffin embedded (FFPE) samples suffer from the degradation of nucleic acids, a problem that becomes particularly acute with samples stored for extended periods. It remains challenging to profile FFPE using high-throughput sequencing technologies including RNA-sequencing, and the resulting FFPE RNA-seq (fRNA-seq) data has a high rate of transcript dropout, a property shared with single cell RNA-seq. Transcript counts also have high variance and are prone to extreme values, together making downstream analyses extremely challenging.</p><p><strong>Methods: </strong>We introduce the PaRaffin Embedded Formalin-FixEd Cleaning Tool (PREFFECT), a probabilistic framework for the analysis of fRNA-seq data. PREFFECT uses generative models to fit distributions to observed expression counts while adjusting for technical and biological variables. The framework can exploit multiple expression profiles generated from matched tissues for a single sample (e.g., a tumor and morphologically normal tissue) in order to stabilize profiles and impute missing counts. PREFFECT can also leverage sample-sample adjacency networks that assist graph attention mechanisms to identify the most informative correlations in the data.</p><p><strong>Results: </strong>We evaluated the distribution of transcript counts across a compendium of fRNA-seq datasets, finding the negative binomial distribution best fits the data with little evidence supporting zero-inflated extensions. We use this knowledge in the design of PREFFECT. We show that PREFFECT can accurately impute missing values from fRNAseq count matrices and adjust for batch effects. The inclusion of sample-sample adjacency networks and multiple tissues were shown to enhance sample clustering.</p><p><strong>Conclusions: </strong>The vast majority of studies to date contain at most a few hundred profiles, making it challenging to correctly infer good statistical fits for each transcript especially in complex cohorts, given the noisy, incomplete and heterogeneous nature of the data. The integrative and generative approach of PREFFECT provides better and more specific model fits than generic bulk RNA-seq tools, especially when more advanced PREFFECT models provide matched profiles are included in the analysis. The transformed data can be directly used with many well-established tools for downstream analysis tasks, empowering its use in clinical biomarker studies and diagnostics.</p>\",\"PeriodicalId\":17458,\"journal\":{\"name\":\"Journal of Translational Medicine\",\"volume\":\"23 1\",\"pages\":\"1023\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12486589/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Translational Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12967-025-07031-y\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, RESEARCH & EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Translational Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12967-025-07031-y","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
Generative and integrative modeling for transcriptomics with formalin fixed paraffin embedded material.
Background: Formalin-fixed paraffin embedded (FFPE) samples suffer from the degradation of nucleic acids, a problem that becomes particularly acute with samples stored for extended periods. It remains challenging to profile FFPE using high-throughput sequencing technologies including RNA-sequencing, and the resulting FFPE RNA-seq (fRNA-seq) data has a high rate of transcript dropout, a property shared with single cell RNA-seq. Transcript counts also have high variance and are prone to extreme values, together making downstream analyses extremely challenging.
Methods: We introduce the PaRaffin Embedded Formalin-FixEd Cleaning Tool (PREFFECT), a probabilistic framework for the analysis of fRNA-seq data. PREFFECT uses generative models to fit distributions to observed expression counts while adjusting for technical and biological variables. The framework can exploit multiple expression profiles generated from matched tissues for a single sample (e.g., a tumor and morphologically normal tissue) in order to stabilize profiles and impute missing counts. PREFFECT can also leverage sample-sample adjacency networks that assist graph attention mechanisms to identify the most informative correlations in the data.
Results: We evaluated the distribution of transcript counts across a compendium of fRNA-seq datasets, finding the negative binomial distribution best fits the data with little evidence supporting zero-inflated extensions. We use this knowledge in the design of PREFFECT. We show that PREFFECT can accurately impute missing values from fRNAseq count matrices and adjust for batch effects. The inclusion of sample-sample adjacency networks and multiple tissues were shown to enhance sample clustering.
Conclusions: The vast majority of studies to date contain at most a few hundred profiles, making it challenging to correctly infer good statistical fits for each transcript especially in complex cohorts, given the noisy, incomplete and heterogeneous nature of the data. The integrative and generative approach of PREFFECT provides better and more specific model fits than generic bulk RNA-seq tools, especially when more advanced PREFFECT models provide matched profiles are included in the analysis. The transformed data can be directly used with many well-established tools for downstream analysis tasks, empowering its use in clinical biomarker studies and diagnostics.
期刊介绍:
The Journal of Translational Medicine is an open-access journal that publishes articles focusing on information derived from human experimentation to enhance communication between basic and clinical science. It covers all areas of translational medicine.