{"title":"Maximum Projection Gini Correlation (MaGiC) for mixed categorical and numerical data","authors":"Hong Xiao , Radhakrishna Adhikari , Yixin Chen , Xin Dang","doi":"10.1016/j.jspi.2025.106294","DOIUrl":null,"url":null,"abstract":"<div><div>We propose a projection correlation for measure of dependence between numerical multivariate variables and categorical variables. The projection correlation, defined as the maximum of the Gini correlations (i.e., MaGiC) between the categorical variable and the univariate projections of the multivariate vector, is non-parametric, and intuitively produces a high coefficient when the two variables are dependent, and zero when they are independent. We show that MaGiC possesses the property of nestedness, in that it is non-decreasing with the increasing number of features in the numerical vector, while remaining unchanged if additional numerical features are independent of the categorical variable and original features. We establish <span><math><msqrt><mrow><mi>n</mi></mrow></msqrt></math></span>-consistency of the sample projection correlation. A powerful <span><math><mi>K</mi></math></span>-sample test can be carried out via the MaGiC-based independence test. When compared with related correlation definitions for multivariate variables, MaGiC also enjoys a faster implementation, with the computational complexity <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>m</mi><mi>n</mi><mrow><mo>(</mo><mi>d</mi><mo>+</mo><mo>log</mo><mi>n</mi><mo>)</mo></mrow><mo>)</mo></mrow></mrow></math></span> where <span><math><mi>d</mi></math></span> is the dimension of the numerical variable, <span><math><mi>n</mi></math></span> is the sample size, and <span><math><mi>m</mi></math></span> is the number of projections performed, as opposed to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>d</mi><mspace></mspace><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> for Gini correlation. We demonstrate these properties through simulation and application to real datasets.</div></div>","PeriodicalId":50039,"journal":{"name":"Journal of Statistical Planning and Inference","volume":"239 ","pages":"Article 106294"},"PeriodicalIF":0.8000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Statistical Planning and Inference","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0378375825000321","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
Abstract
We propose a projection correlation for measure of dependence between numerical multivariate variables and categorical variables. The projection correlation, defined as the maximum of the Gini correlations (i.e., MaGiC) between the categorical variable and the univariate projections of the multivariate vector, is non-parametric, and intuitively produces a high coefficient when the two variables are dependent, and zero when they are independent. We show that MaGiC possesses the property of nestedness, in that it is non-decreasing with the increasing number of features in the numerical vector, while remaining unchanged if additional numerical features are independent of the categorical variable and original features. We establish -consistency of the sample projection correlation. A powerful -sample test can be carried out via the MaGiC-based independence test. When compared with related correlation definitions for multivariate variables, MaGiC also enjoys a faster implementation, with the computational complexity where is the dimension of the numerical variable, is the sample size, and is the number of projections performed, as opposed to for Gini correlation. We demonstrate these properties through simulation and application to real datasets.
期刊介绍:
The Journal of Statistical Planning and Inference offers itself as a multifaceted and all-inclusive bridge between classical aspects of statistics and probability, and the emerging interdisciplinary aspects that have a potential of revolutionizing the subject. While we maintain our traditional strength in statistical inference, design, classical probability, and large sample methods, we also have a far more inclusive and broadened scope to keep up with the new problems that confront us as statisticians, mathematicians, and scientists.
We publish high quality articles in all branches of statistics, probability, discrete mathematics, machine learning, and bioinformatics. We also especially welcome well written and up to date review articles on fundamental themes of statistics, probability, machine learning, and general biostatistics. Thoughtful letters to the editors, interesting problems in need of a solution, and short notes carrying an element of elegance or beauty are equally welcome.