An Improved Ensemble Method With Data Resampling for Credit Risk Prediction

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2025-04-22 DOI:10.1109/ACCESS.2025.3563432

Idowu Aruleba;Yanxia Sun

{"title":"An Improved Ensemble Method With Data Resampling for Credit Risk Prediction","authors":"Idowu Aruleba;Yanxia Sun","doi":"10.1109/ACCESS.2025.3563432","DOIUrl":null,"url":null,"abstract":"The increasing complexity and dynamic nature of financial data present significant challenges in accurately predicting credit risk, a critical task in the banking and finance sector. The application of machine learning (ML) in credit risk prediction has been hindered by the imbalanced nature of credit datasets. This study proposes an improved approach for predicting credit risk using a stacked ensemble method combined with a hybrid data resampling technique. The ensemble comprises random forests, logistic regression, and a convolutional neural network (CNN) as base learners, with the multilayer perceptron (MLP) serving as a meta-learner. To address the data imbalance, the Synthetic Minority Over-sampling Technique and Edited Nearest Neighbors (SMOTE-ENN) technique were applied. The proposed approach is benchmarked against other well-performing classifiers, including random forest, logistic regression, MLP, and CNN. The integration of hybrid data resampling with a robust stacking ensemble significantly enhanced credit risk prediction, with the proposed approach achieving sensitivity and specificity of 0.921 and 0.946 for the Australian dataset and 0.928 and 0.891 for the German dataset. Also, the stacked classifier achieved a sensitivity and specificity of 0.000 and 1.000 before data resampling for the Credit Risk Classification dataset with an accuracy of 0.7644. After data resampling, the accuracy, sensitivity, and specificity are 0.8056, 0.7989 and 0.8125, respectively. On the other hand, using the credit risk analysis for the extended banking loans dataset, the accuracy, sensitivity and specificity of the stacked classifier before data resampling are 0.8429, 0.6316, and 0.9216, respectively. After data resampling, the accuracy, sensitivity and specificity scores of the stacked classifier trained using the credit risk analysis for the extended banking loans dataset are 0.9632, 1.0000, and 0.9242, respectively. This shows that after data resampling, the performance of the stacked classifier trained using the credit risk analysis for the extended banking loans dataset outperformed other models.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"71275-71287"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10973108","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10973108/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing complexity and dynamic nature of financial data present significant challenges in accurately predicting credit risk, a critical task in the banking and finance sector. The application of machine learning (ML) in credit risk prediction has been hindered by the imbalanced nature of credit datasets. This study proposes an improved approach for predicting credit risk using a stacked ensemble method combined with a hybrid data resampling technique. The ensemble comprises random forests, logistic regression, and a convolutional neural network (CNN) as base learners, with the multilayer perceptron (MLP) serving as a meta-learner. To address the data imbalance, the Synthetic Minority Over-sampling Technique and Edited Nearest Neighbors (SMOTE-ENN) technique were applied. The proposed approach is benchmarked against other well-performing classifiers, including random forest, logistic regression, MLP, and CNN. The integration of hybrid data resampling with a robust stacking ensemble significantly enhanced credit risk prediction, with the proposed approach achieving sensitivity and specificity of 0.921 and 0.946 for the Australian dataset and 0.928 and 0.891 for the German dataset. Also, the stacked classifier achieved a sensitivity and specificity of 0.000 and 1.000 before data resampling for the Credit Risk Classification dataset with an accuracy of 0.7644. After data resampling, the accuracy, sensitivity, and specificity are 0.8056, 0.7989 and 0.8125, respectively. On the other hand, using the credit risk analysis for the extended banking loans dataset, the accuracy, sensitivity and specificity of the stacked classifier before data resampling are 0.8429, 0.6316, and 0.9216, respectively. After data resampling, the accuracy, sensitivity and specificity scores of the stacked classifier trained using the credit risk analysis for the extended banking loans dataset are 0.9632, 1.0000, and 0.9242, respectively. This shows that after data resampling, the performance of the stacked classifier trained using the credit risk analysis for the extended banking loans dataset outperformed other models.

查看原文本刊更多论文

信用风险预测中一种改进的数据重采样集成方法

金融数据日益增加的复杂性和动态性为准确预测信贷风险提出了重大挑战，这是银行和金融部门的一项关键任务。信用数据集的不平衡性阻碍了机器学习在信用风险预测中的应用。本研究提出了一种改进的信用风险预测方法，该方法采用叠加集成方法结合混合数据重采样技术。该集成包括随机森林、逻辑回归和卷积神经网络（CNN）作为基础学习器，多层感知器（MLP）作为元学习器。为了解决数据不平衡问题，采用了合成少数过采样技术和编辑近邻技术（SMOTE-ENN）。所提出的方法与其他性能良好的分类器进行了基准测试，包括随机森林、逻辑回归、MLP和CNN。混合数据重采样与鲁棒叠加集成显著增强了信用风险预测，该方法对澳大利亚数据集的灵敏度和特异性分别为0.921和0.946，对德国数据集的灵敏度和特异性分别为0.928和0.891。此外，在信用风险分类数据重采样前，堆叠分类器的灵敏度和特异性分别为0.000和1.000，准确率为0.7644。数据重采样后，准确率为0.8056，灵敏度为0.7989，特异度为0.8125。另一方面，对扩展的银行贷款数据集进行信用风险分析，数据重采样前的堆叠分类器的准确率为0.8429，灵敏度为0.6316，特异性为0.9216。数据重采样后，使用信用风险分析训练的堆叠分类器对扩展银行贷款数据集的准确率、灵敏度和特异性得分分别为0.9632、1.0000和0.9242。这表明，在数据重采样后，使用扩展银行贷款数据集的信用风险分析训练的堆叠分类器的性能优于其他模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.