{"title":"Association Between Nominal Categorical Variables: New Measure Formulation Based on Metric Distances and Value Validity","authors":"Tarald O. Kvålseth","doi":"10.1007/s42519-023-00344-5","DOIUrl":null,"url":null,"abstract":"Abstract When dealing with nominal categorical data, it is often desirable to know the degree of association or dependence between the categorical variables. While there is literally no limit to the number of alternative association measures that have been proposed over the years, they all yield greatly varying, contradictory, and unreliable results due to their lack of an important property: value validity. After discussing the value-validity property, this paper introduces a new measure of association (dependence) based on the mean Euclidean distance between probability distributions, one being a distribution under independence. Both the asymmetric form, when one variable can be considered as the explanatory (independent) variable and one as the response (dependent) variable, and the symmetric form of the measure are introduced. Particular emphasis is given to the important 2 × 2 case when each variable has two categories, but the general case of any number of categories is also covered. Besides having the value-validity property, the new measure has all the prerequisites of a good association measure. Comparisons are made with the well-known Goodman–Kruskal lambda and tau measures. Statistical inference procedure for the new measure is also derived and numerical examples are provided.","PeriodicalId":45853,"journal":{"name":"Journal of Statistical Theory and Practice","volume":"76 1","pages":"0"},"PeriodicalIF":0.6000,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Statistical Theory and Practice","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s42519-023-00344-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
Abstract
Abstract When dealing with nominal categorical data, it is often desirable to know the degree of association or dependence between the categorical variables. While there is literally no limit to the number of alternative association measures that have been proposed over the years, they all yield greatly varying, contradictory, and unreliable results due to their lack of an important property: value validity. After discussing the value-validity property, this paper introduces a new measure of association (dependence) based on the mean Euclidean distance between probability distributions, one being a distribution under independence. Both the asymmetric form, when one variable can be considered as the explanatory (independent) variable and one as the response (dependent) variable, and the symmetric form of the measure are introduced. Particular emphasis is given to the important 2 × 2 case when each variable has two categories, but the general case of any number of categories is also covered. Besides having the value-validity property, the new measure has all the prerequisites of a good association measure. Comparisons are made with the well-known Goodman–Kruskal lambda and tau measures. Statistical inference procedure for the new measure is also derived and numerical examples are provided.