Integration of Bulk RNA-seq Pipeline Metrics for Assessing Low-Quality Samples.

Research square Pub Date : 2025-07-03 DOI:10.21203/rs.3.rs-6976695/v1

Samuel Hamilton, Gaurav Gadhvi, Tyler Therron, Deborah R Winter

{"title":"Integration of Bulk RNA-seq Pipeline Metrics for Assessing Low-Quality Samples.","authors":"Samuel Hamilton, Gaurav Gadhvi, Tyler Therron, Deborah R Winter","doi":"10.21203/rs.3.rs-6976695/v1","DOIUrl":null,"url":null,"abstract":"Background With the rise of RNA-seq as an essential and ubiquitous tool for biomedical research, the need for guidelines on quality control (QC) is pressing. Specifically, there remains limited data as to which technical metrics are most informative in identifying low-quality samples. Results Here, we addressed this issue by developing the Quality Control Diagnostic Renderer (QC-DR), software designed to simultaneously visualize a comprehensive panel of QC metrics generated by an RNA-seq pipeline and flag samples with aberrant values when compared to a reference dataset. As an example, we applied QC-DR to the Successful Clinical Response in Pneumonia Therapy (SCRIPT) dataset, a large clinical RNA-seq dataset of sequenced alveolar macrophages (n = 252). Next, we used this dataset to assess relationships between a variety of QC metrics and sample quality. Among the most highly correlated pipeline QC metrics were % and # Uniquely Aligned Reads , % rRNA reads , # Detected Genes , and our newly developed metric of Area Under the Gene Body Coverage Curve (AUC-GBC ), while experimental QC metrics derived from the lab were not significantly correlated. We then trained a set of machine learning models on the SCRIPT dataset to evaluate the relative contribution of QC metrics to sample quality prediction. Our model performs well when tested on an independent dataset despite differences in the distribution of QC metrics. Conclusions Our results support the conclusion that any individual QC metric is limited in its predictive value and suggests approaches based on the integration of multiple metrics with QC thresholds. In summary, our work provides new insights, practical guidance, and novel QC software which can be used to improve the methodological rigor of RNA-seq studies.","PeriodicalId":519972,"journal":{"name":"Research square","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12236924/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research square","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21203/rs.3.rs-6976695/v1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background With the rise of RNA-seq as an essential and ubiquitous tool for biomedical research, the need for guidelines on quality control (QC) is pressing. Specifically, there remains limited data as to which technical metrics are most informative in identifying low-quality samples. Results Here, we addressed this issue by developing the Quality Control Diagnostic Renderer (QC-DR), software designed to simultaneously visualize a comprehensive panel of QC metrics generated by an RNA-seq pipeline and flag samples with aberrant values when compared to a reference dataset. As an example, we applied QC-DR to the Successful Clinical Response in Pneumonia Therapy (SCRIPT) dataset, a large clinical RNA-seq dataset of sequenced alveolar macrophages (n = 252). Next, we used this dataset to assess relationships between a variety of QC metrics and sample quality. Among the most highly correlated pipeline QC metrics were % and # Uniquely Aligned Reads , % rRNA reads , # Detected Genes , and our newly developed metric of Area Under the Gene Body Coverage Curve (AUC-GBC ), while experimental QC metrics derived from the lab were not significantly correlated. We then trained a set of machine learning models on the SCRIPT dataset to evaluate the relative contribution of QC metrics to sample quality prediction. Our model performs well when tested on an independent dataset despite differences in the distribution of QC metrics. Conclusions Our results support the conclusion that any individual QC metric is limited in its predictive value and suggests approaches based on the integration of multiple metrics with QC thresholds. In summary, our work provides new insights, practical guidance, and novel QC software which can be used to improve the methodological rigor of RNA-seq studies.

查看原文本刊更多论文

整合用于评估低质量样品的大量RNA-seq管道指标。

随着RNA-seq作为生物医学研究中必不可少和无处不在的工具的兴起，对质量控制（QC）指南的需求迫在眉睫。具体来说，关于哪些技术指标在识别低质量样本方面最具信息性的数据仍然有限。在这里，我们通过开发质量控制诊断渲染器（QC- dr）解决了这个问题，该软件旨在同时可视化由RNA-seq管道生成的全面QC指标面板，并标记与参考数据集相比具有异常值的样本。例如，我们将QC-DR应用于肺炎治疗的成功临床反应（SCRIPT）数据集，这是一个大型肺泡巨噬细胞测序的临床rna序列数据集（n = 252）。接下来，我们使用该数据集来评估各种QC指标与样品质量之间的关系。其中相关性最高的流水线QC指标是%和#唯一对齐Reads、% rRNA Reads、#检测基因和我们新开发的基因覆盖曲线下面积（AUC-GBC）指标，而来自实验室的实验QC指标没有显著相关性。然后，我们在SCRIPT数据集上训练了一组机器学习模型，以评估QC指标对样本质量预测的相对贡献。我们的模型在独立数据集上测试时表现良好，尽管QC指标的分布存在差异。我们的研究结果支持了任何单个质量控制指标的预测价值都是有限的这一结论，并提出了基于多个质量控制指标与质量控制阈值相结合的方法。总之，我们的工作提供了新的见解，实践指导和新的QC软件，可用于提高RNA-seq研究方法的严谨性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Research square

自引率

0.00%

发文量