大数据，小样本。

IF 1.2 4区数学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Biostatistics Pub Date : 2017-05-20 DOI:10.1515/ijb-2017-0012

Inna Gerlovina, Mark J van der Laan, Alan Hubbard

{"title":"大数据，小样本。","authors":"Inna Gerlovina, Mark J van der Laan, Alan Hubbard","doi":"10.1515/ijb-2017-0012","DOIUrl":null,"url":null,"abstract":"Multiple comparisons and small sample size, common characteristics of many types of \"Big Data\" including those that are produced by genomic studies, present specific challenges that affect reliability of inference. Use of multiple testing procedures necessitates calculation of very small tail probabilities of a test statistic distribution. Results based on large deviation theory provide a formal condition that is necessary to guarantee error rate control given practical sample sizes, linking the number of tests and the sample size; this condition, however, is rarely satisfied. Using methods that are based on Edgeworth expansions (relying especially on the work of Peter Hall), we explore the impact of departures of sampling distributions from typical assumptions on actual error rates. Our investigation illustrates how far the actual error rates can be from the declared nominal levels, suggesting potentially wide-spread problems with error rate control, specifically excessive false positives. This is an important factor that contributes to \"reproducibility crisis\". We also review some other commonly used methods (such as permutation and methods based on finite sampling inequalities) in their application to multiple testing/small sample data. We point out that Edgeworth expansions, providing higher order approximations to the sampling distribution, offer a promising direction for data analysis that could improve reliability of studies relying on large numbers of comparisons with modest sample sizes.","PeriodicalId":49058,"journal":{"name":"International Journal of Biostatistics","volume":"13 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2017-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/ijb-2017-0012","citationCount":"4","resultStr":"{\"title\":\"Big Data, Small Sample.\",\"authors\":\"Inna Gerlovina, Mark J van der Laan, Alan Hubbard\",\"doi\":\"10.1515/ijb-2017-0012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multiple comparisons and small sample size, common characteristics of many types of \\\"Big Data\\\" including those that are produced by genomic studies, present specific challenges that affect reliability of inference. Use of multiple testing procedures necessitates calculation of very small tail probabilities of a test statistic distribution. Results based on large deviation theory provide a formal condition that is necessary to guarantee error rate control given practical sample sizes, linking the number of tests and the sample size; this condition, however, is rarely satisfied. Using methods that are based on Edgeworth expansions (relying especially on the work of Peter Hall), we explore the impact of departures of sampling distributions from typical assumptions on actual error rates. Our investigation illustrates how far the actual error rates can be from the declared nominal levels, suggesting potentially wide-spread problems with error rate control, specifically excessive false positives. This is an important factor that contributes to \\\"reproducibility crisis\\\". We also review some other commonly used methods (such as permutation and methods based on finite sampling inequalities) in their application to multiple testing/small sample data. We point out that Edgeworth expansions, providing higher order approximations to the sampling distribution, offer a promising direction for data analysis that could improve reliability of studies relying on large numbers of comparisons with modest sample sizes.\",\"PeriodicalId\":49058,\"journal\":{\"name\":\"International Journal of Biostatistics\",\"volume\":\"13 1\",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2017-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1515/ijb-2017-0012\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Biostatistics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1515/ijb-2017-0012\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/ijb-2017-0012","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 4

摘要

多重比较和小样本量是许多类型的“大数据”的共同特征，包括基因组研究产生的数据，这对影响推断的可靠性提出了具体的挑战。使用多个检验程序需要计算检验统计量分布的非常小的尾部概率。基于大偏差理论的结果提供了保证给定实际样本量时错误率控制所必需的正式条件，将试验次数和样本量联系起来;然而，这个条件很少得到满足。使用基于Edgeworth展开(特别依赖于Peter Hall的工作)的方法，我们探索了抽样分布偏离典型假设对实际错误率的影响。我们的调查说明了实际错误率与声明的名义水平之间的差距有多大，这表明错误率控制存在潜在的广泛问题，特别是过多的误报。这是造成“可重复性危机”的一个重要因素。我们还回顾了其他一些常用的方法(如排列和基于有限抽样不等式的方法)在多重测试/小样本数据中的应用。我们指出，Edgeworth展开式提供了抽样分布的高阶近似，为数据分析提供了一个有希望的方向，可以提高依赖于大量比较和适度样本量的研究的可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Big Data, Small Sample.

Multiple comparisons and small sample size, common characteristics of many types of "Big Data" including those that are produced by genomic studies, present specific challenges that affect reliability of inference. Use of multiple testing procedures necessitates calculation of very small tail probabilities of a test statistic distribution. Results based on large deviation theory provide a formal condition that is necessary to guarantee error rate control given practical sample sizes, linking the number of tests and the sample size; this condition, however, is rarely satisfied. Using methods that are based on Edgeworth expansions (relying especially on the work of Peter Hall), we explore the impact of departures of sampling distributions from typical assumptions on actual error rates. Our investigation illustrates how far the actual error rates can be from the declared nominal levels, suggesting potentially wide-spread problems with error rate control, specifically excessive false positives. This is an important factor that contributes to "reproducibility crisis". We also review some other commonly used methods (such as permutation and methods based on finite sampling inequalities) in their application to multiple testing/small sample data. We point out that Edgeworth expansions, providing higher order approximations to the sampling distribution, offer a promising direction for data analysis that could improve reliability of studies relying on large numbers of comparisons with modest sample sizes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Biostatistics MATHEMATICAL & COMPUTATIONAL BIOLOGY-STATISTICS & PROBABILITY

CiteScore

2.10

自引率

8.30%

发文量

审稿时长

>12 weeks

期刊介绍： The International Journal of Biostatistics (IJB) seeks to publish new biostatistical models and methods, new statistical theory, as well as original applications of statistical methods, for important practical problems arising from the biological, medical, public health, and agricultural sciences with an emphasis on semiparametric methods. Given many alternatives to publish exist within biostatistics, IJB offers a place to publish for research in biostatistics focusing on modern methods, often based on machine-learning and other data-adaptive methodologies, as well as providing a unique reading experience that compels the author to be explicit about the statistical inference problem addressed by the paper. IJB is intended that the journal cover the entire range of biostatistics, from theoretical advances to relevant and sensible translations of a practical problem into a statistical framework. Electronic publication also allows for data and software code to be appended, and opens the door for reproducible research allowing readers to easily replicate analyses described in a paper. Both original research and review articles will be warmly received, as will articles applying sound statistical methods to practical problems.