Two-stage imputation method to handle missing data for categorical response variable

IF 0.5 Q4 STATISTICS & PROBABILITY
Jong-Min Kim, Kee-Jae Lee, Seung-Joo Lee
{"title":"Two-stage imputation method to handle missing data for categorical response variable","authors":"Jong-Min Kim, Kee-Jae Lee, Seung-Joo Lee","doi":"10.29220/csam.2023.30.6.577","DOIUrl":null,"url":null,"abstract":"Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.","PeriodicalId":44931,"journal":{"name":"Communications for Statistical Applications and Methods","volume":"150 1","pages":""},"PeriodicalIF":0.5000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications for Statistical Applications and Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29220/csam.2023.30.6.577","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0

Abstract

Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.
处理分类响应变量缺失数据的两阶段估算法
传统的分类数据估算技术(如模式估算)经常会遇到与高估有关的问题。如果变量的类别过多,多叉逻辑回归估算方法可能会因为计算上的限制而无法实现。为了解决这些问题,我们提出了一种两阶段归因法。在第一阶段,我们在完整数据集上使用 Boruta 变量选择法来识别目标分类变量的重要变量。然后,在第二阶段,我们利用目标分类变量的重要变量进行逻辑回归,以弥补二元变量的缺失数据;利用多项式回归,以弥补分类变量的缺失数据;利用预测均值匹配,以弥补定量变量的缺失数据。通过对非对称和非正态模拟数据及真实数据的分析,我们证明了两阶段估算方法优于缺乏变量选择的估算方法,这一点在准确度测量中得到了证明。在对真实调查数据的分析中,我们还证明了我们建议的两阶段估算方法在准确性方面超过了当前的估算方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
0.90
自引率
0.00%
发文量
49
期刊介绍: Communications for Statistical Applications and Methods (Commun. Stat. Appl. Methods, CSAM) is an official journal of the Korean Statistical Society and Korean International Statistical Society. It is an international and Open Access journal dedicated to publishing peer-reviewed, high quality and innovative statistical research. CSAM publishes articles on applied and methodological research in the areas of statistics and probability. It features rapid publication and broad coverage of statistical applications and methods. It welcomes papers on novel applications of statistical methodology in the areas including medicine (pharmaceutical, biotechnology, medical device), business, management, economics, ecology, education, computing, engineering, operational research, biology, sociology and earth science, but papers from other areas are also considered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信