Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics

Elias Chaibub Neto
{"title":"Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics","authors":"Elias Chaibub Neto","doi":"arxiv-2406.06488","DOIUrl":null,"url":null,"abstract":"Non-parametric two-sample tests based on energy distance or maximum mean\ndiscrepancy are widely used statistical tests for comparing multivariate data\nfrom two populations. While these tests enjoy desirable statistical properties,\ntheir test statistics can be expensive to compute as they require the\ncomputation of 3 distinct Euclidean distance (or kernel) matrices between\nsamples, where the time complexity of each of these computations (namely,\n$O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically\nwith the number of samples ($n_x$, $n_y$) and linearly with the number of\nvariables ($p$). Since the standard permutation test requires repeated\nre-computations of these expensive statistics it's application to large\ndatasets can become unfeasible. While several statistical approaches have been\nproposed to mitigate this issue, they all sacrifice desirable statistical\nproperties to decrease the computational cost (e.g., trade computation speed by\na decrease in statistical power). A better computational strategy is to first\npre-compute the Euclidean distance (kernel) matrix of the concatenated data,\nand then permute indexes and retrieve the corresponding elements to compute the\nre-sampled statistics. While this strategy can reduce the computation cost\nrelative to the standard permutation test, it relies on the computation of a\nlarger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$.\nIn this paper, we present a novel computationally efficient permutation\nalgorithm which only requires the pre-computation of the 3 smaller matrices and\nachieves large computational speedups without sacrificing finite-sample\nvalidity or statistical power. We illustrate its computational gains in a\nseries of experiments and compare its statistical power to the current\nstate-of-the-art approach for balancing computational cost and statistical\nperformance.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.06488","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Non-parametric two-sample tests based on energy distance or maximum mean discrepancy are widely used statistical tests for comparing multivariate data from two populations. While these tests enjoy desirable statistical properties, their test statistics can be expensive to compute as they require the computation of 3 distinct Euclidean distance (or kernel) matrices between samples, where the time complexity of each of these computations (namely, $O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically with the number of samples ($n_x$, $n_y$) and linearly with the number of variables ($p$). Since the standard permutation test requires repeated re-computations of these expensive statistics it's application to large datasets can become unfeasible. While several statistical approaches have been proposed to mitigate this issue, they all sacrifice desirable statistical properties to decrease the computational cost (e.g., trade computation speed by a decrease in statistical power). A better computational strategy is to first pre-compute the Euclidean distance (kernel) matrix of the concatenated data, and then permute indexes and retrieve the corresponding elements to compute the re-sampled statistics. While this strategy can reduce the computation cost relative to the standard permutation test, it relies on the computation of a larger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$. In this paper, we present a novel computationally efficient permutation algorithm which only requires the pre-computation of the 3 smaller matrices and achieves large computational speedups without sacrificing finite-sample validity or statistical power. We illustrate its computational gains in a series of experiments and compare its statistical power to the current state-of-the-art approach for balancing computational cost and statistical performance.
基于能量距离或最大均值差异统计的多变量双样本问题计算效率高的置换检验
基于能量距离或最大我差的非参数双样本检验是广泛使用的统计检验方法,用于比较来自两个群体的多元数据。虽然这些检验具有理想的统计特性,但由于需要计算样本间 3 个不同的欧氏距离(或核)矩阵,其检验统计量的计算成本可能很高、其中每个计算的时间复杂度(即 $O(n_{x}^2 p)$、$O(n_{y}^2 p)$ 和 $O(n_{x} n_{y} p)$)与样本数($n_x$、$n_y$)成二次方关系,与变量数($p$)成线性关系。由于标准置换检验需要反复重新计算这些昂贵的统计量,因此应用于大型数据集可能变得不可行。虽然已经提出了几种统计方法来缓解这一问题,但它们都牺牲了理想的统计特性来降低计算成本(例如,以降低统计能力来换取计算速度)。一种更好的计算策略是,首先预先计算串联数据的欧氏距离(核)矩阵,然后对索引进行置换并检索相应的元素,以计算其采样统计量。在本文中,我们提出了一种新型计算高效的置换算法,它只需要预先计算 3 个较小的矩阵,并在不牺牲有限样本有效性或统计能力的前提下实现了较快的计算速度。我们在一系列实验中说明了该算法的计算收益,并将其统计能力与当前最先进的方法进行了比较,以平衡计算成本和统计性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信