No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data.

IF 0.8 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology Pub Date : 2017-04-25 DOI:10.1515/sagmb-2017-0010

Aaron T L Lun, Gordon K Smyth

{"title":"No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data.","authors":"Aaron T L Lun, Gordon K Smyth","doi":"10.1515/sagmb-2017-0010","DOIUrl":null,"url":null,"abstract":"Abstract RNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 2","pages":"83-93"},"PeriodicalIF":0.8000,"publicationDate":"2017-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0010","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Applications in Genetics and Molecular Biology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/sagmb-2017-0010","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 15

Abstract

Abstract RNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.

查看原文本刊更多论文

没有计数，没有方差:在评估RNA-seq数据的生物变异性时，考虑到自由度的损失。

RNA测序(RNA-seq)被广泛用于研究与治疗或生物条件相关的基因表达变化。从RNA-seq数据中检测差异表达(DE)的许多流行方法使用广义线性模型(GLMs)拟合每个基因的独立重复样本的读取计数。本文表明，当模型包含的拟合值恰好为零时，线性模型中剩余自由度(d.f.)的标准公式被夸大了。这样的拟合值出现在任何治疗组的所有计数为零的情况下，以及在更复杂的模型中，如那些涉及成对比较的模型中。这种错误的说明导致了基因方差的低估和I型误差控制的丧失。本文提出了一个减少残差d.f.的公式，该公式恢复了模拟RNA-seq数据中的错误控制，并提高了真实数据分析中DE基因的检测。该方法是在edgeR软件包的准似然框架中实现的。本文的结果也适用于将线性模型应用于对数转换计数(如limma软件包中的计数)的RNA-seq分析，以及更普遍地应用于任何基于计数的GLM，其中可能恰好为零拟合值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistical Applications in Genetics and Molecular Biology BIOCHEMISTRY & MOLECULAR BIOLOGY-MATHEMATICAL & COMPUTATIONAL BIOLOGY

自引率

11.10%

发文量

期刊介绍： Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies. Both original research and review articles will be warmly received.