{"title":"Comparative Study of Missing Value Imputation Techniques on E-Commerce Product Ratings","authors":"Dimple Chehal, Parul Gupta, Payal Gulati, Tanisha Gupta","doi":"10.31449/inf.v47i3.4156","DOIUrl":null,"url":null,"abstract":"Missing data is a common occurrence in practically all studies, and it adds a layer of ambiguity to data interpretation. Missing values in a dataset mean loss of important information. It is one of the most common data quality issues. Missing values are values that are not present in the data set. These are usually written as NAN’s, blanks, or any other placeholders. Missing values create imbalanced observations, biased estimates and sometimes lead to misleading results. The majority of real-world datasets have missing values. As a result, to deliver an efficient and valid analysis and the solutions should be taken into account appropriately. By filling in the missing values a complete dataset can be created and the challenge of dealing with complex patterns of missingness can be avoided. Missing values can be of both continuous and categorical types. To get more precise results, a variety of techniques to fill out missing values can be used. In the present study, nine different imputation methods: Simple Imputer, Last Observation Carried forward (LOCF), KNN Imputation (KNN), Hot Deck, Linear Regression, MissForest, Random Forest Regression, DataWig, and Multivariate Imputation by Chained Equation (MICE) were compared. The comparison was performed on Amazon real-time dataset based on three evaluation criteria: R- Squared (R 2 ), Mean squared error (MSE), and Mean absolute error (MAE). As a result of the findings KNN had the best outcomes, while DataWig had the worst results for R- Squared (R 2 ). The R-squared value ranges from 0-1. In terms of mean squared error (MSE) and mean absolute error (MAE), the Hot deck imputation approach fared best, whereas MissForest performed worst (MAE). The hot deck imputation method appears to be of interest and merits further investigation in practice.","PeriodicalId":56292,"journal":{"name":"Informatica","volume":"240 1","pages":"0"},"PeriodicalIF":3.3000,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31449/inf.v47i3.4156","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Missing data is a common occurrence in practically all studies, and it adds a layer of ambiguity to data interpretation. Missing values in a dataset mean loss of important information. It is one of the most common data quality issues. Missing values are values that are not present in the data set. These are usually written as NAN’s, blanks, or any other placeholders. Missing values create imbalanced observations, biased estimates and sometimes lead to misleading results. The majority of real-world datasets have missing values. As a result, to deliver an efficient and valid analysis and the solutions should be taken into account appropriately. By filling in the missing values a complete dataset can be created and the challenge of dealing with complex patterns of missingness can be avoided. Missing values can be of both continuous and categorical types. To get more precise results, a variety of techniques to fill out missing values can be used. In the present study, nine different imputation methods: Simple Imputer, Last Observation Carried forward (LOCF), KNN Imputation (KNN), Hot Deck, Linear Regression, MissForest, Random Forest Regression, DataWig, and Multivariate Imputation by Chained Equation (MICE) were compared. The comparison was performed on Amazon real-time dataset based on three evaluation criteria: R- Squared (R 2 ), Mean squared error (MSE), and Mean absolute error (MAE). As a result of the findings KNN had the best outcomes, while DataWig had the worst results for R- Squared (R 2 ). The R-squared value ranges from 0-1. In terms of mean squared error (MSE) and mean absolute error (MAE), the Hot deck imputation approach fared best, whereas MissForest performed worst (MAE). The hot deck imputation method appears to be of interest and merits further investigation in practice.
期刊介绍:
The quarterly journal Informatica provides an international forum for high-quality original research and publishes papers on mathematical simulation and optimization, recognition and control, programming theory and systems, automation systems and elements. Informatica provides a multidisciplinary forum for scientists and engineers involved in research and design including experts who implement and manage information systems applications.