{"title":"连续数据和分类数据的相关度量比较","authors":"E. Skotarczak, A. Dobek, K. Moliński","doi":"10.2478/bile-2019-0015","DOIUrl":null,"url":null,"abstract":"Summary In the literature there can be found a wide collection of correlation and association coefficients used for different structures of data. Generally, some of the correlation coefficients are conventionally used for continuous data and others for categorical or ordinal observations. The aim of this paper is to verify the performance of various approaches to correlation coefficient estimation for several types of observations. Both simulated and real data were analysed. For continuous variables, Pearson’s r2 and MIC were determined, whereas for categorized data three approaches were compared: Cramér’s V, Joe’s estimator, and the regression-based estimator. Two method of discretization for continuous data were used. The following conclusions were drawn: the regression-based approach yielded the best results for data with the highest assumed r2 coefficient, whereas Joe’s estimator was the better approximation of true correlation when the assumed r2 was small; and the MIC estimator detected the maximal level of dependency for data having a quadratic relation. Moreover, the discretization method applied to data with a non-linear dependency can cause loss of dependency information. The calculations were supported by the R packages arules and minerva.","PeriodicalId":8933,"journal":{"name":"Biometrical Letters","volume":"1 1","pages":"253 - 261"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Comparison of some correlation measures for continuous and categorical data\",\"authors\":\"E. Skotarczak, A. Dobek, K. Moliński\",\"doi\":\"10.2478/bile-2019-0015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary In the literature there can be found a wide collection of correlation and association coefficients used for different structures of data. Generally, some of the correlation coefficients are conventionally used for continuous data and others for categorical or ordinal observations. The aim of this paper is to verify the performance of various approaches to correlation coefficient estimation for several types of observations. Both simulated and real data were analysed. For continuous variables, Pearson’s r2 and MIC were determined, whereas for categorized data three approaches were compared: Cramér’s V, Joe’s estimator, and the regression-based estimator. Two method of discretization for continuous data were used. The following conclusions were drawn: the regression-based approach yielded the best results for data with the highest assumed r2 coefficient, whereas Joe’s estimator was the better approximation of true correlation when the assumed r2 was small; and the MIC estimator detected the maximal level of dependency for data having a quadratic relation. Moreover, the discretization method applied to data with a non-linear dependency can cause loss of dependency information. The calculations were supported by the R packages arules and minerva.\",\"PeriodicalId\":8933,\"journal\":{\"name\":\"Biometrical Letters\",\"volume\":\"1 1\",\"pages\":\"253 - 261\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biometrical Letters\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/bile-2019-0015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biometrical Letters","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/bile-2019-0015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
在文献中,可以发现用于不同数据结构的相关系数和关联系数的广泛集合。一般来说,一些相关系数通常用于连续数据,而其他相关系数用于分类或顺序观测。本文的目的是验证对几种类型的观测值进行相关系数估计的各种方法的性能。对模拟数据和实际数据进行了分析。对于连续变量,我们确定了Pearson的r2和MIC,而对于分类数据,我们比较了三种方法:cram s V、Joe的估计器和基于回归的估计器。对连续数据采用了两种离散化方法。结果表明:当假设r2系数较大时,基于回归的方法获得的结果最好,而当假设r2较小时,Joe’s estimator更接近真实相关;MIC估计器检测具有二次关系的数据的最大依赖程度。此外,对具有非线性相关性的数据采用离散化方法会导致相关性信息的丢失。计算由R包规则和minerva支持。
Comparison of some correlation measures for continuous and categorical data
Summary In the literature there can be found a wide collection of correlation and association coefficients used for different structures of data. Generally, some of the correlation coefficients are conventionally used for continuous data and others for categorical or ordinal observations. The aim of this paper is to verify the performance of various approaches to correlation coefficient estimation for several types of observations. Both simulated and real data were analysed. For continuous variables, Pearson’s r2 and MIC were determined, whereas for categorized data three approaches were compared: Cramér’s V, Joe’s estimator, and the regression-based estimator. Two method of discretization for continuous data were used. The following conclusions were drawn: the regression-based approach yielded the best results for data with the highest assumed r2 coefficient, whereas Joe’s estimator was the better approximation of true correlation when the assumed r2 was small; and the MIC estimator detected the maximal level of dependency for data having a quadratic relation. Moreover, the discretization method applied to data with a non-linear dependency can cause loss of dependency information. The calculations were supported by the R packages arules and minerva.