{"title":"An Improved Ensemble Method With Data Resampling for Credit Risk Prediction","authors":"Idowu Aruleba;Yanxia Sun","doi":"10.1109/ACCESS.2025.3563432","DOIUrl":null,"url":null,"abstract":"The increasing complexity and dynamic nature of financial data present significant challenges in accurately predicting credit risk, a critical task in the banking and finance sector. The application of machine learning (ML) in credit risk prediction has been hindered by the imbalanced nature of credit datasets. This study proposes an improved approach for predicting credit risk using a stacked ensemble method combined with a hybrid data resampling technique. The ensemble comprises random forests, logistic regression, and a convolutional neural network (CNN) as base learners, with the multilayer perceptron (MLP) serving as a meta-learner. To address the data imbalance, the Synthetic Minority Over-sampling Technique and Edited Nearest Neighbors (SMOTE-ENN) technique were applied. The proposed approach is benchmarked against other well-performing classifiers, including random forest, logistic regression, MLP, and CNN. The integration of hybrid data resampling with a robust stacking ensemble significantly enhanced credit risk prediction, with the proposed approach achieving sensitivity and specificity of 0.921 and 0.946 for the Australian dataset and 0.928 and 0.891 for the German dataset. Also, the stacked classifier achieved a sensitivity and specificity of 0.000 and 1.000 before data resampling for the Credit Risk Classification dataset with an accuracy of 0.7644. After data resampling, the accuracy, sensitivity, and specificity are 0.8056, 0.7989 and 0.8125, respectively. On the other hand, using the credit risk analysis for the extended banking loans dataset, the accuracy, sensitivity and specificity of the stacked classifier before data resampling are 0.8429, 0.6316, and 0.9216, respectively. After data resampling, the accuracy, sensitivity and specificity scores of the stacked classifier trained using the credit risk analysis for the extended banking loans dataset are 0.9632, 1.0000, and 0.9242, respectively. This shows that after data resampling, the performance of the stacked classifier trained using the credit risk analysis for the extended banking loans dataset outperformed other models.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"71275-71287"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10973108","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10973108/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The increasing complexity and dynamic nature of financial data present significant challenges in accurately predicting credit risk, a critical task in the banking and finance sector. The application of machine learning (ML) in credit risk prediction has been hindered by the imbalanced nature of credit datasets. This study proposes an improved approach for predicting credit risk using a stacked ensemble method combined with a hybrid data resampling technique. The ensemble comprises random forests, logistic regression, and a convolutional neural network (CNN) as base learners, with the multilayer perceptron (MLP) serving as a meta-learner. To address the data imbalance, the Synthetic Minority Over-sampling Technique and Edited Nearest Neighbors (SMOTE-ENN) technique were applied. The proposed approach is benchmarked against other well-performing classifiers, including random forest, logistic regression, MLP, and CNN. The integration of hybrid data resampling with a robust stacking ensemble significantly enhanced credit risk prediction, with the proposed approach achieving sensitivity and specificity of 0.921 and 0.946 for the Australian dataset and 0.928 and 0.891 for the German dataset. Also, the stacked classifier achieved a sensitivity and specificity of 0.000 and 1.000 before data resampling for the Credit Risk Classification dataset with an accuracy of 0.7644. After data resampling, the accuracy, sensitivity, and specificity are 0.8056, 0.7989 and 0.8125, respectively. On the other hand, using the credit risk analysis for the extended banking loans dataset, the accuracy, sensitivity and specificity of the stacked classifier before data resampling are 0.8429, 0.6316, and 0.9216, respectively. After data resampling, the accuracy, sensitivity and specificity scores of the stacked classifier trained using the credit risk analysis for the extended banking loans dataset are 0.9632, 1.0000, and 0.9242, respectively. This shows that after data resampling, the performance of the stacked classifier trained using the credit risk analysis for the extended banking loans dataset outperformed other models.
IEEE AccessCOMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
9.80
自引率
7.70%
发文量
6673
审稿时长
6 weeks
期刊介绍:
IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest.
IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on:
Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals.
Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering.
Development of new or improved fabrication or manufacturing techniques.
Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.