福尔马林固定石蜡包埋材料转录组学的生成和整合建模。

IF 7.5 2区医学 Q1 MEDICINE, RESEARCH & EXPERIMENTAL

Journal of Translational Medicine Pub Date : 2025-09-30 DOI:10.1186/s12967-025-07031-y

Eliseos J Mucaki, Wenhan Zhang, Aryamaan Saha, Sabina Trebinjac, Sharon Nofech-Mozes, Eileen Rakovitch, Vanessa Dumeaux, Michael T Hallett

{"title":"福尔马林固定石蜡包埋材料转录组学的生成和整合建模。","authors":"Eliseos J Mucaki, Wenhan Zhang, Aryamaan Saha, Sabina Trebinjac, Sharon Nofech-Mozes, Eileen Rakovitch, Vanessa Dumeaux, Michael T Hallett","doi":"10.1186/s12967-025-07031-y","DOIUrl":null,"url":null,"abstract":"Background: Formalin-fixed paraffin embedded (FFPE) samples suffer from the degradation of nucleic acids, a problem that becomes particularly acute with samples stored for extended periods. It remains challenging to profile FFPE using high-throughput sequencing technologies including RNA-sequencing, and the resulting FFPE RNA-seq (fRNA-seq) data has a high rate of transcript dropout, a property shared with single cell RNA-seq. Transcript counts also have high variance and are prone to extreme values, together making downstream analyses extremely challenging.Methods: We introduce the PaRaffin Embedded Formalin-FixEd Cleaning Tool (PREFFECT), a probabilistic framework for the analysis of fRNA-seq data. PREFFECT uses generative models to fit distributions to observed expression counts while adjusting for technical and biological variables. The framework can exploit multiple expression profiles generated from matched tissues for a single sample (e.g., a tumor and morphologically normal tissue) in order to stabilize profiles and impute missing counts. PREFFECT can also leverage sample-sample adjacency networks that assist graph attention mechanisms to identify the most informative correlations in the data.Results: We evaluated the distribution of transcript counts across a compendium of fRNA-seq datasets, finding the negative binomial distribution best fits the data with little evidence supporting zero-inflated extensions. We use this knowledge in the design of PREFFECT. We show that PREFFECT can accurately impute missing values from fRNAseq count matrices and adjust for batch effects. The inclusion of sample-sample adjacency networks and multiple tissues were shown to enhance sample clustering.Conclusions: The vast majority of studies to date contain at most a few hundred profiles, making it challenging to correctly infer good statistical fits for each transcript especially in complex cohorts, given the noisy, incomplete and heterogeneous nature of the data. The integrative and generative approach of PREFFECT provides better and more specific model fits than generic bulk RNA-seq tools, especially when more advanced PREFFECT models provide matched profiles are included in the analysis. The transformed data can be directly used with many well-established tools for downstream analysis tasks, empowering its use in clinical biomarker studies and diagnostics.","PeriodicalId":17458,"journal":{"name":"Journal of Translational Medicine","volume":"23 1","pages":"1023"},"PeriodicalIF":7.5000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12486589/pdf/","citationCount":"0","resultStr":"{\"title\":\"Generative and integrative modeling for transcriptomics with formalin fixed paraffin embedded material.\",\"authors\":\"Eliseos J Mucaki, Wenhan Zhang, Aryamaan Saha, Sabina Trebinjac, Sharon Nofech-Mozes, Eileen Rakovitch, Vanessa Dumeaux, Michael T Hallett\",\"doi\":\"10.1186/s12967-025-07031-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Formalin-fixed paraffin embedded (FFPE) samples suffer from the degradation of nucleic acids, a problem that becomes particularly acute with samples stored for extended periods. It remains challenging to profile FFPE using high-throughput sequencing technologies including RNA-sequencing, and the resulting FFPE RNA-seq (fRNA-seq) data has a high rate of transcript dropout, a property shared with single cell RNA-seq. Transcript counts also have high variance and are prone to extreme values, together making downstream analyses extremely challenging.Methods: We introduce the PaRaffin Embedded Formalin-FixEd Cleaning Tool (PREFFECT), a probabilistic framework for the analysis of fRNA-seq data. PREFFECT uses generative models to fit distributions to observed expression counts while adjusting for technical and biological variables. The framework can exploit multiple expression profiles generated from matched tissues for a single sample (e.g., a tumor and morphologically normal tissue) in order to stabilize profiles and impute missing counts. PREFFECT can also leverage sample-sample adjacency networks that assist graph attention mechanisms to identify the most informative correlations in the data.Results: We evaluated the distribution of transcript counts across a compendium of fRNA-seq datasets, finding the negative binomial distribution best fits the data with little evidence supporting zero-inflated extensions. We use this knowledge in the design of PREFFECT. We show that PREFFECT can accurately impute missing values from fRNAseq count matrices and adjust for batch effects. The inclusion of sample-sample adjacency networks and multiple tissues were shown to enhance sample clustering.Conclusions: The vast majority of studies to date contain at most a few hundred profiles, making it challenging to correctly infer good statistical fits for each transcript especially in complex cohorts, given the noisy, incomplete and heterogeneous nature of the data. The integrative and generative approach of PREFFECT provides better and more specific model fits than generic bulk RNA-seq tools, especially when more advanced PREFFECT models provide matched profiles are included in the analysis. The transformed data can be directly used with many well-established tools for downstream analysis tasks, empowering its use in clinical biomarker studies and diagnostics.\",\"PeriodicalId\":17458,\"journal\":{\"name\":\"Journal of Translational Medicine\",\"volume\":\"23 1\",\"pages\":\"1023\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12486589/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Translational Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12967-025-07031-y\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, RESEARCH & EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Translational Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12967-025-07031-y","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

摘要

背景：福尔马林固定石蜡包埋（FFPE）样品遭受核酸降解的困扰，这一问题在样品长时间储存时变得尤为严重。使用包括rna测序在内的高通量测序技术来分析FFPE仍然具有挑战性，并且由此产生的FFPE RNA-seq （fRNA-seq）数据具有高转录本辍学率，这与单细胞RNA-seq具有相同的特性。转录本计数也有很高的方差，并且容易出现极值，这使得下游分析极具挑战性。方法：我们介绍了石蜡嵌入福尔马林固定清洗工具（PREFFECT），这是一种用于分析fRNA-seq数据的概率框架。PREFFECT使用生成模型来拟合分布到观察到的表达计数，同时调整技术和生物变量。该框架可以利用单个样本（例如，肿瘤和形态学正常组织）的匹配组织生成的多个表达谱，以稳定谱并推算缺失计数。preeffect还可以利用样本-样本邻接网络，帮助图形注意机制识别数据中最具信息量的相关性。结果：我们评估了转录本计数在fRNA-seq数据集汇编中的分布，发现负二项分布最适合数据，几乎没有证据支持零膨胀扩展。我们在设计preeffect时使用了这些知识。我们证明了PREFFECT可以准确地从fRNAseq计数矩阵中推算缺失值，并根据批处理效果进行调整。样本-样本邻接网络和多个组织的包含被证明可以增强样本聚类。结论：到目前为止，绝大多数研究最多包含几百个概况，这使得正确推断每个转录本的良好统计拟合具有挑战性，特别是在复杂的队列中，考虑到数据的嘈杂、不完整和异质性。PREFFECT的集成和生成方法比一般的批量RNA-seq工具提供了更好和更具体的模型拟合，特别是当更先进的PREFFECT模型提供匹配的图谱时，包括在分析中。转换后的数据可以直接与许多成熟的工具一起用于下游分析任务，从而增强其在临床生物标志物研究和诊断中的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Generative and integrative modeling for transcriptomics with formalin fixed paraffin embedded material.

Background: Formalin-fixed paraffin embedded (FFPE) samples suffer from the degradation of nucleic acids, a problem that becomes particularly acute with samples stored for extended periods. It remains challenging to profile FFPE using high-throughput sequencing technologies including RNA-sequencing, and the resulting FFPE RNA-seq (fRNA-seq) data has a high rate of transcript dropout, a property shared with single cell RNA-seq. Transcript counts also have high variance and are prone to extreme values, together making downstream analyses extremely challenging.

Methods: We introduce the PaRaffin Embedded Formalin-FixEd Cleaning Tool (PREFFECT), a probabilistic framework for the analysis of fRNA-seq data. PREFFECT uses generative models to fit distributions to observed expression counts while adjusting for technical and biological variables. The framework can exploit multiple expression profiles generated from matched tissues for a single sample (e.g., a tumor and morphologically normal tissue) in order to stabilize profiles and impute missing counts. PREFFECT can also leverage sample-sample adjacency networks that assist graph attention mechanisms to identify the most informative correlations in the data.

Results: We evaluated the distribution of transcript counts across a compendium of fRNA-seq datasets, finding the negative binomial distribution best fits the data with little evidence supporting zero-inflated extensions. We use this knowledge in the design of PREFFECT. We show that PREFFECT can accurately impute missing values from fRNAseq count matrices and adjust for batch effects. The inclusion of sample-sample adjacency networks and multiple tissues were shown to enhance sample clustering.

Conclusions: The vast majority of studies to date contain at most a few hundred profiles, making it challenging to correctly infer good statistical fits for each transcript especially in complex cohorts, given the noisy, incomplete and heterogeneous nature of the data. The integrative and generative approach of PREFFECT provides better and more specific model fits than generic bulk RNA-seq tools, especially when more advanced PREFFECT models provide matched profiles are included in the analysis. The transformed data can be directly used with many well-established tools for downstream analysis tasks, empowering its use in clinical biomarker studies and diagnostics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Translational Medicine 医学-医学：研究与实验

CiteScore

10.00

自引率

1.40%

发文量

537

审稿时长

1 months

期刊介绍： The Journal of Translational Medicine is an open-access journal that publishes articles focusing on information derived from human experimentation to enhance communication between basic and clinical science. It covers all areas of translational medicine.