{"title":"Toward accurate credit evaluation: an efficient imputation approach for financial data","authors":"Jie Lu , Shengda Zhuo , Jinjie Qiu , Yin Tang","doi":"10.1016/j.dsm.2025.06.001","DOIUrl":null,"url":null,"abstract":"<div><div>Missing instances and mixed data types, including discrete and ordered (e.g., continuous and ordinal) variables, are widespread in many datasets in the finance sector. In this domain, estimating missing instances is crucial because many data analysis pipelines require complete data, which is particularly challenging for mixed-type data. However, existing methods treat discrete and ordinal data as continuous values, which may reduce efficacy in addressing these challenges. To fill this gap, this study proposes a probabilistic imputation method for mixed-type and incomplete loan data (PMILD), using a mixed Gaussian Copula model that supports single and multiple imputations. The method models mixed discrete and ordinal data using latent Gaussian distributions, where observed features with arbitrary margins are mapped to the latent normal space, and feature correlations are approximated through the expectation-maximization process in the latent space. Empirical results on nine real-world datasets demonstrate that PMILD substantially outperforms state-of-the-art imputation methods, providing a highly effective solution for handling mixed-type and incomplete loan data. This advancement enhances both operational efficiency and credit evaluation accuracy in finance-related applications.</div></div>","PeriodicalId":100353,"journal":{"name":"Data Science and Management","volume":"8 3","pages":"Pages 374-387"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science and Management","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666764925000281","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Missing instances and mixed data types, including discrete and ordered (e.g., continuous and ordinal) variables, are widespread in many datasets in the finance sector. In this domain, estimating missing instances is crucial because many data analysis pipelines require complete data, which is particularly challenging for mixed-type data. However, existing methods treat discrete and ordinal data as continuous values, which may reduce efficacy in addressing these challenges. To fill this gap, this study proposes a probabilistic imputation method for mixed-type and incomplete loan data (PMILD), using a mixed Gaussian Copula model that supports single and multiple imputations. The method models mixed discrete and ordinal data using latent Gaussian distributions, where observed features with arbitrary margins are mapped to the latent normal space, and feature correlations are approximated through the expectation-maximization process in the latent space. Empirical results on nine real-world datasets demonstrate that PMILD substantially outperforms state-of-the-art imputation methods, providing a highly effective solution for handling mixed-type and incomplete loan data. This advancement enhances both operational efficiency and credit evaluation accuracy in finance-related applications.