Correcting scale distortion in RNA sequencing data.

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2025-01-28 DOI:10.1186/s12859-025-06041-3

Christopher Thron, Farhad Jafari

{"title":"Correcting scale distortion in RNA sequencing data.","authors":"Christopher Thron, Farhad Jafari","doi":"10.1186/s12859-025-06041-3","DOIUrl":null,"url":null,"abstract":"<p><p>RNA sequencing (RNA-seq) is the conventional genome-scale approach used to capture the expression levels of all detectable genes in a biological sample. This is now regularly used for population-based studies designed to identify genetic determinants of various diseases. Naturally, the accuracy of these tests should be verified and improved if possible. In this study, we aimed to detect and correct for expression level-dependent errors which are not corrected by conventional normalization techniques. We examined several RNA-seq datasets from the Cancer Genome Atlas (TCGA), Stand Up 2 Cancer (SU2C), and GTEx databases with various types of preprocessing. By applying local averaging, we found expression-level dependent biases that differ from sample to sample in all datasets studied. Using simulations, we show that these biases corrupt gene-gene correlation estimations and t tests between subpopulations. To mitigate these biases, we introduce two different nonlinear transforms based on statistical considerations that correct these observed biases. We demonstrate that these transforms effectively remove the observed per-sample biases, reduce sample-to-sample variance, and improve the characteristics of gene-gene correlation distributions. Using a novel simulation methodology that creates controlled differences between subpopulations, we show that these transforms reduce variability and increase sensitivity of two population tests. The improvements in sensitivity and specificity were of the order of 3-5% in most instances after the data was corrected for bias. Altogether, these results improve our capacity to understand gene-gene relationships, and may lead to novel ways to utilize the information derived from clinical tests.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"32"},"PeriodicalIF":2.9000,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11776150/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06041-3","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

RNA sequencing (RNA-seq) is the conventional genome-scale approach used to capture the expression levels of all detectable genes in a biological sample. This is now regularly used for population-based studies designed to identify genetic determinants of various diseases. Naturally, the accuracy of these tests should be verified and improved if possible. In this study, we aimed to detect and correct for expression level-dependent errors which are not corrected by conventional normalization techniques. We examined several RNA-seq datasets from the Cancer Genome Atlas (TCGA), Stand Up 2 Cancer (SU2C), and GTEx databases with various types of preprocessing. By applying local averaging, we found expression-level dependent biases that differ from sample to sample in all datasets studied. Using simulations, we show that these biases corrupt gene-gene correlation estimations and t tests between subpopulations. To mitigate these biases, we introduce two different nonlinear transforms based on statistical considerations that correct these observed biases. We demonstrate that these transforms effectively remove the observed per-sample biases, reduce sample-to-sample variance, and improve the characteristics of gene-gene correlation distributions. Using a novel simulation methodology that creates controlled differences between subpopulations, we show that these transforms reduce variability and increase sensitivity of two population tests. The improvements in sensitivity and specificity were of the order of 3-5% in most instances after the data was corrected for bias. Altogether, these results improve our capacity to understand gene-gene relationships, and may lead to novel ways to utilize the information derived from clinical tests.

查看原文本刊更多论文

校正RNA测序数据的尺度畸变。

RNA测序（RNA-seq）是传统的基因组尺度方法，用于捕获生物样本中所有可检测基因的表达水平。这种方法现在经常用于以人群为基础的研究，旨在确定各种疾病的遗传决定因素。当然，如果可能的话，这些测试的准确性应该得到验证和改进。在这项研究中，我们的目的是检测和纠正表达水平相关的错误，这些错误不能通过传统的规范化技术来纠正。我们检查了来自癌症基因组图谱（TCGA）、Stand Up 2 Cancer （SU2C）和GTEx数据库的RNA-seq数据集，并进行了不同类型的预处理。通过应用局部平均，我们发现在所有研究的数据集中，不同样本的表达水平依赖偏差不同。通过模拟，我们发现这些偏差破坏了亚种群之间的基因-基因相关性估计和t检验。为了减轻这些偏差，我们基于统计考虑引入了两种不同的非线性变换来纠正这些观察到的偏差。我们证明，这些变换有效地消除了观察到的每样本偏差，减少了样本间方差，并改善了基因-基因相关分布的特征。使用一种新的模拟方法，在亚种群之间创建受控差异，我们表明这些转换减少了变异并增加了两个种群测试的敏感性。在大多数情况下，在纠正数据偏差后，敏感性和特异性的改善约为3-5%。总之，这些结果提高了我们理解基因-基因关系的能力，并可能导致利用来自临床试验的信息的新方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.