Two-stage imputation method to handle missing data for categorical response variable

IF 0.6 Q4 STATISTICS & PROBABILITY

Communications for Statistical Applications and Methods Pub Date : 2023-11-30 DOI:10.29220/csam.2023.30.6.577

Jong-Min Kim, Kee-Jae Lee, Seung-Joo Lee

{"title":"Two-stage imputation method to handle missing data for categorical response variable","authors":"Jong-Min Kim, Kee-Jae Lee, Seung-Joo Lee","doi":"10.29220/csam.2023.30.6.577","DOIUrl":null,"url":null,"abstract":"Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the ﬁrst stage, we utilize the Boruta variable selection method on the complete dataset to identify signiﬁcant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.","PeriodicalId":44931,"journal":{"name":"Communications for Statistical Applications and Methods","volume":"150 1","pages":""},"PeriodicalIF":0.6000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications for Statistical Applications and Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29220/csam.2023.30.6.577","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the ﬁrst stage, we utilize the Boruta variable selection method on the complete dataset to identify signiﬁcant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.

查看原文本刊更多论文

处理分类响应变量缺失数据的两阶段估算法

传统的分类数据估算技术（如模式估算）经常会遇到与高估有关的问题。如果变量的类别过多，多叉逻辑回归估算方法可能会因为计算上的限制而无法实现。为了解决这些问题，我们提出了一种两阶段归因法。在第一阶段，我们在完整数据集上使用 Boruta 变量选择法来识别目标分类变量的重要变量。然后，在第二阶段，我们利用目标分类变量的重要变量进行逻辑回归，以弥补二元变量的缺失数据；利用多项式回归，以弥补分类变量的缺失数据；利用预测均值匹配，以弥补定量变量的缺失数据。通过对非对称和非正态模拟数据及真实数据的分析，我们证明了两阶段估算方法优于缺乏变量选择的估算方法，这一点在准确度测量中得到了证明。在对真实调查数据的分析中，我们还证明了我们建议的两阶段估算方法在准确性方面超过了当前的估算方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Communications for Statistical Applications and Methods STATISTICS & PROBABILITY-

CiteScore

0.90

自引率

0.00%

发文量

期刊介绍： Communications for Statistical Applications and Methods (Commun. Stat. Appl. Methods, CSAM) is an official journal of the Korean Statistical Society and Korean International Statistical Society. It is an international and Open Access journal dedicated to publishing peer-reviewed, high quality and innovative statistical research. CSAM publishes articles on applied and methodological research in the areas of statistics and probability. It features rapid publication and broad coverage of statistical applications and methods. It welcomes papers on novel applications of statistical methodology in the areas including medicine (pharmaceutical, biotechnology, medical device), business, management, economics, ecology, education, computing, engineering, operational research, biology, sociology and earth science, but papers from other areas are also considered.