{"title":"通过随机投影检验高维数据的同质性","authors":"Tao Qiu , Qintong Zhang , Yuanyuan Fang , Wangli Xu","doi":"10.1016/j.jmva.2023.105252","DOIUrl":null,"url":null,"abstract":"<div><p><span><span>Testing for homogeneity of two random vectors is a fundamental problem in statistics. In the past two decades, numerous efforts have been made to detect heterogeneity when the random vectors are multivariate or even high dimensional. Due to the “curse of dimensionality”, existing tests based on </span>Euclidean distance<span> may fail to capture the overall homogeneity in high-dimensional settings while can only capture the moment discrepancy. To address this issue, we propose a fully nonparametric test for homogeneity of two random vectors. Our method involves randomly selecting two subspaces consisting of components of the vectors, projecting the subspaces onto one-dimensional spaces, respectively, and constructing the test statistic using the Cramér–von Mises distance of the projections. To enhance the performance, we repeatedly implement this procedure to construct the final test statistic. Theoretically, if the replication time tends to infinity, we can avoid potential power loss caused by lousy directions. Owing to the </span></span><span><math><mi>U</mi></math></span><span>-statistic theory, the asymptotic null<span> distribution of our proposed test is standard normal, regardless of the parent distributions of the random samples and the relationship between data dimensions and sample sizes. As a result, no re-sampling procedure is needed to determine critical values. The empirical size and power of the proposed test are demonstrated through numerical simulations.</span></span></p></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"200 ","pages":"Article 105252"},"PeriodicalIF":1.4000,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Testing homogeneity in high dimensional data through random projections\",\"authors\":\"Tao Qiu , Qintong Zhang , Yuanyuan Fang , Wangli Xu\",\"doi\":\"10.1016/j.jmva.2023.105252\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p><span><span>Testing for homogeneity of two random vectors is a fundamental problem in statistics. In the past two decades, numerous efforts have been made to detect heterogeneity when the random vectors are multivariate or even high dimensional. Due to the “curse of dimensionality”, existing tests based on </span>Euclidean distance<span> may fail to capture the overall homogeneity in high-dimensional settings while can only capture the moment discrepancy. To address this issue, we propose a fully nonparametric test for homogeneity of two random vectors. Our method involves randomly selecting two subspaces consisting of components of the vectors, projecting the subspaces onto one-dimensional spaces, respectively, and constructing the test statistic using the Cramér–von Mises distance of the projections. To enhance the performance, we repeatedly implement this procedure to construct the final test statistic. Theoretically, if the replication time tends to infinity, we can avoid potential power loss caused by lousy directions. Owing to the </span></span><span><math><mi>U</mi></math></span><span>-statistic theory, the asymptotic null<span> distribution of our proposed test is standard normal, regardless of the parent distributions of the random samples and the relationship between data dimensions and sample sizes. As a result, no re-sampling procedure is needed to determine critical values. The empirical size and power of the proposed test are demonstrated through numerical simulations.</span></span></p></div>\",\"PeriodicalId\":16431,\"journal\":{\"name\":\"Journal of Multivariate Analysis\",\"volume\":\"200 \",\"pages\":\"Article 105252\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2023-11-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Multivariate Analysis\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0047259X23000982\",\"RegionNum\":3,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Multivariate Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0047259X23000982","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
摘要
两个随机向量的齐性检验是统计学中的一个基本问题。在过去的二十年中,当随机向量是多元甚至高维时,已经做了大量的努力来检测异质性。由于“维度诅咒”,现有的基于欧几里得距离的测试可能无法捕获高维环境下的整体同质性,而只能捕获力矩差异。为了解决这个问题,我们提出了两个随机向量齐性的完全非参数检验。我们的方法包括随机选择两个由向量组成的子空间,分别将子空间投影到一维空间上,并使用投影的cram von Mises距离构造检验统计量。为了提高性能,我们反复执行这个过程来构造最终的测试统计量。从理论上讲,如果复制时间趋于无穷大,我们就可以避免由于错误的方向而造成的潜在功率损失。由于u统计理论,我们提出的检验的渐近零分布是标准正态分布,而不考虑随机样本的父分布以及数据维度和样本量之间的关系。因此,不需要重新采样程序来确定临界值。通过数值模拟验证了该试验的经验规模和有效性。
Testing homogeneity in high dimensional data through random projections
Testing for homogeneity of two random vectors is a fundamental problem in statistics. In the past two decades, numerous efforts have been made to detect heterogeneity when the random vectors are multivariate or even high dimensional. Due to the “curse of dimensionality”, existing tests based on Euclidean distance may fail to capture the overall homogeneity in high-dimensional settings while can only capture the moment discrepancy. To address this issue, we propose a fully nonparametric test for homogeneity of two random vectors. Our method involves randomly selecting two subspaces consisting of components of the vectors, projecting the subspaces onto one-dimensional spaces, respectively, and constructing the test statistic using the Cramér–von Mises distance of the projections. To enhance the performance, we repeatedly implement this procedure to construct the final test statistic. Theoretically, if the replication time tends to infinity, we can avoid potential power loss caused by lousy directions. Owing to the -statistic theory, the asymptotic null distribution of our proposed test is standard normal, regardless of the parent distributions of the random samples and the relationship between data dimensions and sample sizes. As a result, no re-sampling procedure is needed to determine critical values. The empirical size and power of the proposed test are demonstrated through numerical simulations.
期刊介绍:
Founded in 1971, the Journal of Multivariate Analysis (JMVA) is the central venue for the publication of new, relevant methodology and particularly innovative applications pertaining to the analysis and interpretation of multidimensional data.
The journal welcomes contributions to all aspects of multivariate data analysis and modeling, including cluster analysis, discriminant analysis, factor analysis, and multidimensional continuous or discrete distribution theory. Topics of current interest include, but are not limited to, inferential aspects of
Copula modeling
Functional data analysis
Graphical modeling
High-dimensional data analysis
Image analysis
Multivariate extreme-value theory
Sparse modeling
Spatial statistics.