为少量测试设计的局部错误发现率估计器。

IF 0.4 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology Pub Date : 2012-10-12 DOI:10.1515/1544-6115.1807

Marta Padilla, David R Bickel

{"title":"为少量测试设计的局部错误发现率估计器。","authors":"Marta Padilla, David R Bickel","doi":"10.1515/1544-6115.1807","DOIUrl":null,"url":null,"abstract":"Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLEs) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. In addition, we propose the expected LFDR (ELFDR) in order to propagate the uncertainty involved in estimating the prior. We also introduce an optimally weighted combination of the best of the corrected MLEs with a previous estimator that, being based on a binomial distribution, does not require a parametric model of the data distribution across features. An application of the new estimators and previous estimators to protein abundance data illustrates the extent to which different estimators lead to different conclusions about which proteins are affected by cancer. A simulation study was conducted to approximate the bias of the new estimators relative to previous LFDR estimators. Data were simulated for two different numbers of features (N), two different noncentrality parameter values or detectability levels (dalt), and several proportions of unaffected features (p0). One of these previous estimators is a histogram-based estimator (HBE) designed for a large number of features. The simulations show that some of the corrected MLEs and the ELFDR that corrects the HBE reduce the negative bias relative to the MLE and the HBE, respectively. For every method, we defined the worst-case performance as the maximum of the absolute value of the bias over the two different dalt and over various p0. The best worst-case methods represent the safest methods to be used under given conditions. This analysis indicates that the binomial-based method has the lowest worst-case absolute bias for high p0 and for N = 3, 12. However, the corrected MLE that is based on the minimum description length (MDL) principle is the best worst-case method when the value of p0 is more uncertain since it has one of the lowest worst-case biases over all possible values of p0 and for N = 3, 12. Therefore, the safest estimator considered is the binomial-based method when a high proportion of unaffected features can be assumed and the MDL-based method otherwise. A second simulation study was conducted with additional values of N. We found that HBE requires N to be at least 6-12 features to perform as well as the estimators proposed here, with the precise minimum N depending on p0 and dalt.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 5","pages":"4"},"PeriodicalIF":0.4000,"publicationDate":"2012-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1807","citationCount":"5","resultStr":"{\"title\":\"Estimators of the local false discovery rate designed for small numbers of tests.\",\"authors\":\"Marta Padilla, David R Bickel\",\"doi\":\"10.1515/1544-6115.1807\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLEs) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. In addition, we propose the expected LFDR (ELFDR) in order to propagate the uncertainty involved in estimating the prior. We also introduce an optimally weighted combination of the best of the corrected MLEs with a previous estimator that, being based on a binomial distribution, does not require a parametric model of the data distribution across features. An application of the new estimators and previous estimators to protein abundance data illustrates the extent to which different estimators lead to different conclusions about which proteins are affected by cancer. A simulation study was conducted to approximate the bias of the new estimators relative to previous LFDR estimators. Data were simulated for two different numbers of features (N), two different noncentrality parameter values or detectability levels (dalt), and several proportions of unaffected features (p0). One of these previous estimators is a histogram-based estimator (HBE) designed for a large number of features. The simulations show that some of the corrected MLEs and the ELFDR that corrects the HBE reduce the negative bias relative to the MLE and the HBE, respectively. For every method, we defined the worst-case performance as the maximum of the absolute value of the bias over the two different dalt and over various p0. The best worst-case methods represent the safest methods to be used under given conditions. This analysis indicates that the binomial-based method has the lowest worst-case absolute bias for high p0 and for N = 3, 12. However, the corrected MLE that is based on the minimum description length (MDL) principle is the best worst-case method when the value of p0 is more uncertain since it has one of the lowest worst-case biases over all possible values of p0 and for N = 3, 12. Therefore, the safest estimator considered is the binomial-based method when a high proportion of unaffected features can be assumed and the MDL-based method otherwise. A second simulation study was conducted with additional values of N. We found that HBE requires N to be at least 6-12 features to perform as well as the estimators proposed here, with the precise minimum N depending on p0 and dalt.\",\"PeriodicalId\":48980,\"journal\":{\"name\":\"Statistical Applications in Genetics and Molecular Biology\",\"volume\":\"11 5\",\"pages\":\"4\"},\"PeriodicalIF\":0.4000,\"publicationDate\":\"2012-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1515/1544-6115.1807\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Applications in Genetics and Molecular Biology\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1515/1544-6115.1807\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Applications in Genetics and Molecular Biology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/1544-6115.1807","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 5

摘要

基于直方图的经验贝叶斯方法用于分析大量基因，snp或其他生物特征的数据，当应用于具有少量特征的数据时，如常规测量表达的基因，蛋白质和代谢物，往往具有较大的偏差。为了在经验贝叶斯框架中分析这些小尺度和中等尺度的数据，我们引入了局部错误发现率(LFDR)的最大似然估计量(MLEs)的修正。在这种情况下，MLE通过估计先验分布来估计LFDR，即零假设真值的后验概率。校正在于在估计先验所依赖的一个或多个参数时排除每个特征。此外，我们提出了期望LFDR (ELFDR)，以传播估计先验所涉及的不确定性。我们还引入了一种最优加权组合，将最好的修正mle与先前的估计器结合起来，该估计器基于二项分布，不需要跨特征的数据分布的参数模型。新的估计器和以前的估计器对蛋白质丰度数据的应用说明了不同的估计器在多大程度上导致关于哪些蛋白质受癌症影响的不同结论。进行了模拟研究，以近似新的估计器相对于以前的LFDR估计器的偏差。对两种不同数量的特征(N)、两种不同的非中心性参数值或可检测性水平(dalt)以及几种未受影响的特征(p0)的比例进行数据模拟。其中一个先前的估计器是基于直方图的估计器(HBE)，它是为大量的特征而设计的。仿真结果表明，部分修正后的MLE和修正了HBE的ELFDR分别减少了相对于MLE和HBE的负偏置。对于每一种方法，我们都将最坏情况的性能定义为两个不同的数据和不同的p0上的偏差绝对值的最大值。最佳最坏情况方法代表在给定条件下使用的最安全的方法。分析表明，对于高p0和N = 3,12，基于二项的方法具有最低的最坏情况绝对偏差。然而，当p0的值更不确定时，基于最小描述长度(MDL)原则的修正MLE是最佳最坏情况方法，因为它在p0的所有可能值中具有最低的最坏情况偏差之一，并且N = 3,12。因此，考虑的最安全的估计量是基于二项式的方法，当可以假设高比例的未受影响的特征时，否则是基于mdl的方法。我们发现HBE需要N至少有6-12个特征才能达到这里提出的估计器的效果，精确的最小N取决于p0和dalt。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Estimators of the local false discovery rate designed for small numbers of tests.

Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLEs) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. In addition, we propose the expected LFDR (ELFDR) in order to propagate the uncertainty involved in estimating the prior. We also introduce an optimally weighted combination of the best of the corrected MLEs with a previous estimator that, being based on a binomial distribution, does not require a parametric model of the data distribution across features. An application of the new estimators and previous estimators to protein abundance data illustrates the extent to which different estimators lead to different conclusions about which proteins are affected by cancer. A simulation study was conducted to approximate the bias of the new estimators relative to previous LFDR estimators. Data were simulated for two different numbers of features (N), two different noncentrality parameter values or detectability levels (dalt), and several proportions of unaffected features (p0). One of these previous estimators is a histogram-based estimator (HBE) designed for a large number of features. The simulations show that some of the corrected MLEs and the ELFDR that corrects the HBE reduce the negative bias relative to the MLE and the HBE, respectively. For every method, we defined the worst-case performance as the maximum of the absolute value of the bias over the two different dalt and over various p0. The best worst-case methods represent the safest methods to be used under given conditions. This analysis indicates that the binomial-based method has the lowest worst-case absolute bias for high p0 and for N = 3, 12. However, the corrected MLE that is based on the minimum description length (MDL) principle is the best worst-case method when the value of p0 is more uncertain since it has one of the lowest worst-case biases over all possible values of p0 and for N = 3, 12. Therefore, the safest estimator considered is the binomial-based method when a high proportion of unaffected features can be assumed and the MDL-based method otherwise. A second simulation study was conducted with additional values of N. We found that HBE requires N to be at least 6-12 features to perform as well as the estimators proposed here, with the precise minimum N depending on p0 and dalt.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistical Applications in Genetics and Molecular Biology BIOCHEMISTRY & MOLECULAR BIOLOGY-MATHEMATICAL & COMPUTATIONAL BIOLOGY

自引率

11.10%

发文量

期刊介绍： Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies. Both original research and review articles will be warmly received.