数值和分类数据集的简单缺失数据输入技术的比较

International Journal of Research in Engineering and Applied Sciences Pub Date : 2023-04-08 DOI:10.46565/jreas.202381468-475

Ramu Gautam, Shahram Latifi

{"title":"数值和分类数据集的简单缺失数据输入技术的比较","authors":"Ramu Gautam, Shahram Latifi","doi":"10.46565/jreas.202381468-475","DOIUrl":null,"url":null,"abstract":"Almost every dataset has missing data. The common reasons are sensor error, equipment malfunction, human error, or translation loss. We study the efficacy of statistical (mean, median, mode) and machine learning based (k-nearest neighbors) imputation methods in accurately imputing missing data in numerical datasets with data missing not at random (MNAR) and data missing completely at random (MCAR) as well as categorical datasets. Imputed datasets are used to make prediction on the test set and Mean squared error (MSE) in prediction is used as the measure of performance of the imputation. Mean absolute difference between the original and imputed data is also observed. When the data is MCAR, kNN imputation results in lowest MSE for all datasets, making it the most accurate method. When less than 20% of data is missing, mean and median imputations are effective in regression problems. kNN imputation is better at 20% missingness and significantly better when 50% or more data is missing. For the kNN method, k = 5 gives better results than k=3 but k=10 gives similar results to k=5. For MNAR datasets, statistical methods result in similar or lower MSE compared to kNN imputation when less than 25% of instances have a missing feature. For higher missing levels, kNN imputation is superior. Given enough data points without missing features, deleting the instances with missing data may be a better choice at lower missingness levels. For categorical data imputation, kNN and Mode imputation are both effective.","PeriodicalId":14343,"journal":{"name":"International Journal of Research in Engineering and Applied Sciences","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"COMPARISON OF SIMPLE MISSING DATA IMPUTATION TECHNIQUES FOR NUMERICAL AND CATEGORICAL DATASETS\",\"authors\":\"Ramu Gautam, Shahram Latifi\",\"doi\":\"10.46565/jreas.202381468-475\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Almost every dataset has missing data. The common reasons are sensor error, equipment malfunction, human error, or translation loss. We study the efficacy of statistical (mean, median, mode) and machine learning based (k-nearest neighbors) imputation methods in accurately imputing missing data in numerical datasets with data missing not at random (MNAR) and data missing completely at random (MCAR) as well as categorical datasets. Imputed datasets are used to make prediction on the test set and Mean squared error (MSE) in prediction is used as the measure of performance of the imputation. Mean absolute difference between the original and imputed data is also observed. When the data is MCAR, kNN imputation results in lowest MSE for all datasets, making it the most accurate method. When less than 20% of data is missing, mean and median imputations are effective in regression problems. kNN imputation is better at 20% missingness and significantly better when 50% or more data is missing. For the kNN method, k = 5 gives better results than k=3 but k=10 gives similar results to k=5. For MNAR datasets, statistical methods result in similar or lower MSE compared to kNN imputation when less than 25% of instances have a missing feature. For higher missing levels, kNN imputation is superior. Given enough data points without missing features, deleting the instances with missing data may be a better choice at lower missingness levels. For categorical data imputation, kNN and Mode imputation are both effective.\",\"PeriodicalId\":14343,\"journal\":{\"name\":\"International Journal of Research in Engineering and Applied Sciences\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Research in Engineering and Applied Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.46565/jreas.202381468-475\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Research in Engineering and Applied Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46565/jreas.202381468-475","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

几乎每个数据集都有缺失数据。常见的原因是传感器错误、设备故障、人为错误或平移丢失。我们研究了统计(均值、中位数、模式)和基于机器学习(k近邻)的方法在非随机数据缺失(MNAR)和完全随机数据缺失(MCAR)的数值数据集以及分类数据集中准确输入缺失数据的有效性。使用输入的数据集对测试集进行预测，并使用预测中的均方误差(MSE)作为输入性能的度量。还观察到原始数据和输入数据之间的平均绝对差。当数据为MCAR时，kNN法在所有数据集上的均方差最低，是最准确的方法。当丢失的数据少于20%时，均值和中位数估算在回归问题中是有效的。当丢失20%的数据时，kNN imputation效果更好，当丢失50%或更多数据时，效果明显更好。对于kNN方法，k=5给出比k=3更好的结果，但k=10给出与k=5相似的结果。对于MNAR数据集，当少于25%的实例具有缺失特征时，统计方法产生的MSE与kNN imputation相似或更低。对于更高的缺失水平，kNN imputation是优越的。如果有足够的数据点而不缺少特性，那么在缺失程度较低的情况下，删除缺少数据的实例可能是更好的选择。对于分类数据的输入，kNN和Mode输入都是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

COMPARISON OF SIMPLE MISSING DATA IMPUTATION TECHNIQUES FOR NUMERICAL AND CATEGORICAL DATASETS

Almost every dataset has missing data. The common reasons are sensor error, equipment malfunction, human error, or translation loss. We study the efficacy of statistical (mean, median, mode) and machine learning based (k-nearest neighbors) imputation methods in accurately imputing missing data in numerical datasets with data missing not at random (MNAR) and data missing completely at random (MCAR) as well as categorical datasets. Imputed datasets are used to make prediction on the test set and Mean squared error (MSE) in prediction is used as the measure of performance of the imputation. Mean absolute difference between the original and imputed data is also observed. When the data is MCAR, kNN imputation results in lowest MSE for all datasets, making it the most accurate method. When less than 20% of data is missing, mean and median imputations are effective in regression problems. kNN imputation is better at 20% missingness and significantly better when 50% or more data is missing. For the kNN method, k = 5 gives better results than k=3 but k=10 gives similar results to k=5. For MNAR datasets, statistical methods result in similar or lower MSE compared to kNN imputation when less than 25% of instances have a missing feature. For higher missing levels, kNN imputation is superior. Given enough data points without missing features, deleting the instances with missing data may be a better choice at lower missingness levels. For categorical data imputation, kNN and Mode imputation are both effective.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Research in Engineering and Applied Sciences

自引率

0.00%

发文量