针对多元依赖性的多尺度费雪独立性检验。

IF 2.4 2区数学 Q2 BIOLOGY

Biometrika Pub Date : 2022-09-01 Epub Date: 2022-02-21 DOI:10.1093/biomet/asac013

S Gorsky, L Ma

{"title":"针对多元依赖性的多尺度费雪独立性检验。","authors":"S Gorsky, L Ma","doi":"10.1093/biomet/asac013","DOIUrl":null,"url":null,"abstract":"Identifying dependency in multivariate data is a common inference task that arises in numerous applications. However, existing nonparametric independence tests typically require computation that scales at least quadratically with the sample size, making it difficult to apply them in the presence of massive sample sizes. Moreover, resampling is usually necessary to evaluate the statistical significance of the resulting test statistics at finite sample sizes, further worsening the computational burden. We introduce a scalable, resampling-free approach to testing the independence between two random vectors by breaking down the task into simple univariate tests of independence on a collection of 2 × 2 contingency tables constructed through sequential coarse-to-fine discretization of the sample space, transforming the inference task into a multiple testing problem that can be completed with almost linear complexity with respect to the sample size. To address increasing dimensionality, we introduce a coarse-to-fine sequential adaptive procedure that exploits the spatial features of dependency structures. We derive a finite-sample theory that guarantees the inferential validity of our adaptive procedure at any given sample size. We show that our approach can achieve strong control of the level of the testing procedure at any sample size without resampling or asymptotic approximation and establish its large-sample consistency. We demonstrate through an extensive simulation study its substantial computational advantage in comparison to existing approaches while achieving robust statistical power under various dependency scenarios, and illustrate how its divide-and-conquer nature can be exploited to not just test independence, but to learn the nature of the underlying dependency. Finally, we demonstrate the use of our method through analysing a dataset from a flow cytometry experiment.","PeriodicalId":9001,"journal":{"name":"Biometrika","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9648765/pdf/","citationCount":"0","resultStr":"{\"title\":\"Multi-scale Fisher's independence test for multivariate dependence.\",\"authors\":\"S Gorsky, L Ma\",\"doi\":\"10.1093/biomet/asac013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Identifying dependency in multivariate data is a common inference task that arises in numerous applications. However, existing nonparametric independence tests typically require computation that scales at least quadratically with the sample size, making it difficult to apply them in the presence of massive sample sizes. Moreover, resampling is usually necessary to evaluate the statistical significance of the resulting test statistics at finite sample sizes, further worsening the computational burden. We introduce a scalable, resampling-free approach to testing the independence between two random vectors by breaking down the task into simple univariate tests of independence on a collection of 2 × 2 contingency tables constructed through sequential coarse-to-fine discretization of the sample space, transforming the inference task into a multiple testing problem that can be completed with almost linear complexity with respect to the sample size. To address increasing dimensionality, we introduce a coarse-to-fine sequential adaptive procedure that exploits the spatial features of dependency structures. We derive a finite-sample theory that guarantees the inferential validity of our adaptive procedure at any given sample size. We show that our approach can achieve strong control of the level of the testing procedure at any sample size without resampling or asymptotic approximation and establish its large-sample consistency. We demonstrate through an extensive simulation study its substantial computational advantage in comparison to existing approaches while achieving robust statistical power under various dependency scenarios, and illustrate how its divide-and-conquer nature can be exploited to not just test independence, but to learn the nature of the underlying dependency. Finally, we demonstrate the use of our method through analysing a dataset from a flow cytometry experiment.\",\"PeriodicalId\":9001,\"journal\":{\"name\":\"Biometrika\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2022-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9648765/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biometrika\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1093/biomet/asac013\",\"RegionNum\":2,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2022/2/21 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biometrika","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/biomet/asac013","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/2/21 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

识别多元数据中的依赖性是一项常见的推理任务，在许多应用中都会出现。然而，现有的非参数独立性检验通常需要的计算量至少与样本量成二次方关系，因此很难在样本量巨大的情况下应用。此外，在有限样本量的情况下，通常需要重新采样来评估所得检验统计量的统计意义，这进一步加重了计算负担。我们引入了一种可扩展、无需重采样的方法来测试两个随机向量之间的独立性，方法是将任务分解为对通过对样本空间进行从粗到细的顺序离散化而构建的 2 × 2 或然表集合进行简单的单变量独立性测试，从而将推理任务转化为一个多重测试问题，该问题的完成复杂度与样本量几乎呈线性关系。为了解决维度不断增加的问题，我们引入了一种从粗到细的顺序自适应程序，该程序利用了依赖结构的空间特征。我们推导出一种有限样本理论，保证了我们的自适应程序在任何给定样本量下的推论有效性。我们证明，我们的方法可以在任何样本量下实现对测试程序水平的有力控制，而无需重采样或渐近逼近，并建立了其大样本一致性。我们通过大量的模拟研究证明，与现有方法相比，我们的方法在计算上具有很大的优势，同时在各种依赖性情况下都能获得强大的统计能力，并说明了如何利用其分而治之的性质，不仅测试独立性，而且学习潜在依赖性的性质。最后，我们通过分析流式细胞仪实验的数据集演示了我们方法的使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Multi-scale Fisher's independence test for multivariate dependence.

查看原文本刊更多论文

Multi-scale Fisher's independence test for multivariate dependence.

Identifying dependency in multivariate data is a common inference task that arises in numerous applications. However, existing nonparametric independence tests typically require computation that scales at least quadratically with the sample size, making it difficult to apply them in the presence of massive sample sizes. Moreover, resampling is usually necessary to evaluate the statistical significance of the resulting test statistics at finite sample sizes, further worsening the computational burden. We introduce a scalable, resampling-free approach to testing the independence between two random vectors by breaking down the task into simple univariate tests of independence on a collection of 2 × 2 contingency tables constructed through sequential coarse-to-fine discretization of the sample space, transforming the inference task into a multiple testing problem that can be completed with almost linear complexity with respect to the sample size. To address increasing dimensionality, we introduce a coarse-to-fine sequential adaptive procedure that exploits the spatial features of dependency structures. We derive a finite-sample theory that guarantees the inferential validity of our adaptive procedure at any given sample size. We show that our approach can achieve strong control of the level of the testing procedure at any sample size without resampling or asymptotic approximation and establish its large-sample consistency. We demonstrate through an extensive simulation study its substantial computational advantage in comparison to existing approaches while achieving robust statistical power under various dependency scenarios, and illustrate how its divide-and-conquer nature can be exploited to not just test independence, but to learn the nature of the underlying dependency. Finally, we demonstrate the use of our method through analysing a dataset from a flow cytometry experiment.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biometrika 生物-生物学

CiteScore

5.50

自引率

3.70%

发文量

审稿时长

6-12 weeks

期刊介绍： Biometrika is primarily a journal of statistics in which emphasis is placed on papers containing original theoretical contributions of direct or potential value in applications. From time to time, papers in bordering fields are also published.