Impact of Missing Data on Correlation Coefficient Values: Deletion and Imputation Methods for Data Preparation

IF 0.6 Q3 MULTIDISCIPLINARY SCIENCES

Malaysian Journal of Fundamental and Applied Sciences Pub Date : 2023-12-04 DOI:10.11113/mjfas.v19n6.3098

Mohamed Shantal, Z. Othman, Azuraliza Abu Bakar

{"title":"Impact of Missing Data on Correlation Coefficient Values: Deletion and Imputation Methods for Data Preparation","authors":"Mohamed Shantal, Z. Othman, Azuraliza Abu Bakar","doi":"10.11113/mjfas.v19n6.3098","DOIUrl":null,"url":null,"abstract":"The correlation coefficient is one of the essential statistical techniques used to discover relationships among variables. Various techniques can quantify correlation, such as Pearson's, Spearman's, and Kendall's correlation coefficients, depending on the data type. As with any use of data, missing data will impact the availability of data, reducing it and potentially affecting the results. Furthermore, the removal of missing-value data from the study when using complete case analysis or available case analysis may result in selection biases. In this paper, we investigate the impact of missing data on the correlation coefficient value by calculating the difference between the correlation coefficient of the original complete dataset and that of a dataset with missing data. Two deletion strategies (Listwise and Pairwise) and three imputation strategies (Mean, k-Nearest Neighbors (k-NN), and Expectation-Maximization) were used to prepare the data before calculating the correlation coefficient. Unique correlation coefficient values were created by converting unique values to a one-dimensional array, and RMSE metrics were used to evaluate the experiments. Eight UCI and Kaggle datasets with different sizes and numbers of attributes were used in this study. The experiment results demonstrate that the Pairwise strategy and k-NN give good results on the correlation coefficient, respectively, when the missing rate is moderate or less. Pairwise uses all the available values and discards only the missing values of the related attribute, while k-NN fills the missing values with new values that produce correlation coefficient values close to the actual values.","PeriodicalId":18149,"journal":{"name":"Malaysian Journal of Fundamental and Applied Sciences","volume":"7 5","pages":""},"PeriodicalIF":0.6000,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Malaysian Journal of Fundamental and Applied Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11113/mjfas.v19n6.3098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

The correlation coefficient is one of the essential statistical techniques used to discover relationships among variables. Various techniques can quantify correlation, such as Pearson's, Spearman's, and Kendall's correlation coefficients, depending on the data type. As with any use of data, missing data will impact the availability of data, reducing it and potentially affecting the results. Furthermore, the removal of missing-value data from the study when using complete case analysis or available case analysis may result in selection biases. In this paper, we investigate the impact of missing data on the correlation coefficient value by calculating the difference between the correlation coefficient of the original complete dataset and that of a dataset with missing data. Two deletion strategies (Listwise and Pairwise) and three imputation strategies (Mean, k-Nearest Neighbors (k-NN), and Expectation-Maximization) were used to prepare the data before calculating the correlation coefficient. Unique correlation coefficient values were created by converting unique values to a one-dimensional array, and RMSE metrics were used to evaluate the experiments. Eight UCI and Kaggle datasets with different sizes and numbers of attributes were used in this study. The experiment results demonstrate that the Pairwise strategy and k-NN give good results on the correlation coefficient, respectively, when the missing rate is moderate or less. Pairwise uses all the available values and discards only the missing values of the related attribute, while k-NN fills the missing values with new values that produce correlation coefficient values close to the actual values.

查看原文本刊更多论文

缺失数据对相关系数值的影响：数据准备中的删除和估算方法

相关系数是用来发现变量之间关系的基本统计技术之一。根据数据类型，各种技术可以量化相关性，例如Pearson的、Spearman的和Kendall的相关系数。与任何数据的使用一样，缺少数据将影响数据的可用性，减少数据的可用性，并可能影响结果。此外，当使用完整的案例分析或可用的案例分析时，从研究中删除缺失值数据可能会导致选择偏差。本文通过计算原始完整数据集的相关系数与缺失数据集的相关系数之差来研究缺失数据对相关系数值的影响。在计算相关系数之前，使用了两种删除策略(Listwise和Pairwise)和三种imputation策略(Mean, k-Nearest Neighbors (k-NN)和Expectation-Maximization)来准备数据。将唯一的相关系数值转换为一维数组，得到唯一的相关系数值，并使用RMSE指标对实验进行评价。本研究使用了8个不同大小和属性数量的UCI和Kaggle数据集。实验结果表明，当缺失率中等或较小时，配对策略和k-NN分别在相关系数上取得了较好的效果。两两使用所有可用的值，只丢弃相关属性的缺失值，而k-NN用新值填充缺失值，产生接近实际值的相关系数值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Malaysian Journal of Fundamental and Applied Sciences MULTIDISCIPLINARY SCIENCES-

CiteScore

1.40

自引率

0.00%

发文量