Accurate and fast small p-value estimation for permutation tests in high-throughput genomic data analysis with the cross-entropy method.

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology Pub Date : 2023-01-01 DOI:10.1515/sagmb-2021-0067

Yang Shi, Weiping Shi, Mengqiao Wang, Ji-Hyun Lee, Huining Kang, Hui Jiang

{"title":"Accurate and fast small p-value estimation for permutation tests in high-throughput genomic data analysis with the cross-entropy method.","authors":"Yang Shi, Weiping Shi, Mengqiao Wang, Ji-Hyun Lee, Huining Kang, Hui Jiang","doi":"10.1515/sagmb-2021-0067","DOIUrl":null,"url":null,"abstract":"Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small p-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small p-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small p-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Applications in Genetics and Molecular Biology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/sagmb-2021-0067","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 1

Abstract

Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small p-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small p-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small p-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.

查看原文本刊更多论文

交叉熵法用于高通量基因组数据分析中排列检验的准确、快速小p值估计。

当检验统计量在零假设下的抽样分布由于样本量有限而难以分析或不可靠时，排列检验被广泛用于统计假设检验。在基因组研究中应用排列测试的一个关键挑战是，通常需要大量的排列来获得非常小的p值的可靠估计，从而导致大量的计算工作。为了解决这个问题，我们开发了一种算法，用于准确有效地估计配对和独立的两组基因组数据的排列检验中的小p值，我们的方法利用一个新的框架，分别使用伯努利分布和条件伯努利分布，结合交叉熵方法，参数化这两类数据的排列样本空间。通过应用于两个模拟数据集和两个由微阵列和RNA-Seq技术生成的真实基因表达数据集，并与现有方法(如粗排列和SAMC)进行比较，证明了我们提出的算法的性能，结果表明我们的方法在估计小p值方面可以实现数量级的计算效率提升。我们的方法为提高现有排列测试程序的计算效率和开发基因组数据分析中使用排列的新测试方法提供了有希望的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistical Applications in Genetics and Molecular Biology 生物-生化与分子生物学

CiteScore

1.20

自引率

11.10%

发文量

审稿时长

6-12 weeks

期刊介绍： Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies. Both original research and review articles will be warmly received.