比较降维和数据拆分对信用风险评估分类性能的影响

IF 10.7 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Cem Bulut, Emel Arslan
{"title":"比较降维和数据拆分对信用风险评估分类性能的影响","authors":"Cem Bulut,&nbsp;Emel Arslan","doi":"10.1007/s10462-024-10904-1","DOIUrl":null,"url":null,"abstract":"<div><p>Credit risk assessment (CRA) plays an important role in credit decision-making process of financial institutions. Today, developing big data analysis and machine learning methods have marked a new era in credit risk estimation. In recent years, using machine learning methods in credit risk estimation has emerged as an alternative method for financial institutions. The past demographic and financial data of the person whose CRA will be performed is important for creating an automatic artificial intelligence credit score prediction model based on machine learning. It is also necessary to use features correctly to create accurate machine learning models. This article aims to investigate the effects of dimensionality reduction and data splitting steps on the performance of classification algorithms widely used in the literature. In our study, dimensionality reduction was performed using Principal Component Analysis (PCA). Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), Naive Bayes (NB) algorithms were chosen for classification. Percentage splitting (PER, 66–34%) and k-fold (k = 10) cross-validation techniques were used when dividing the data set into training and test data. The results obtained were evaluated with accuracy, recall, F1 score, precision, and AUC metrics. German data set was used in this study. The effect of data splitting and dimension reduction on the classification of CRA systems was examined. The highest ACC in PER and CV data splitting was obtained with the RF algorithm. Using data splitting methods and PCA, the highest accuracy was observed with RF and the highest AUC with NB, with 13 PCs in which 80% of the variance was obtained. As a result, the data set consisting of a total of 20 features, expressed by 13 PCs, achieved similar or higher success than the results obtained from the original data set.</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":null,"pages":null},"PeriodicalIF":10.7000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-024-10904-1.pdf","citationCount":"0","resultStr":"{\"title\":\"Comparison of the impact of dimensionality reduction and data splitting on classification performance in credit risk assessment\",\"authors\":\"Cem Bulut,&nbsp;Emel Arslan\",\"doi\":\"10.1007/s10462-024-10904-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Credit risk assessment (CRA) plays an important role in credit decision-making process of financial institutions. Today, developing big data analysis and machine learning methods have marked a new era in credit risk estimation. In recent years, using machine learning methods in credit risk estimation has emerged as an alternative method for financial institutions. The past demographic and financial data of the person whose CRA will be performed is important for creating an automatic artificial intelligence credit score prediction model based on machine learning. It is also necessary to use features correctly to create accurate machine learning models. This article aims to investigate the effects of dimensionality reduction and data splitting steps on the performance of classification algorithms widely used in the literature. In our study, dimensionality reduction was performed using Principal Component Analysis (PCA). Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), Naive Bayes (NB) algorithms were chosen for classification. Percentage splitting (PER, 66–34%) and k-fold (k = 10) cross-validation techniques were used when dividing the data set into training and test data. The results obtained were evaluated with accuracy, recall, F1 score, precision, and AUC metrics. German data set was used in this study. The effect of data splitting and dimension reduction on the classification of CRA systems was examined. The highest ACC in PER and CV data splitting was obtained with the RF algorithm. Using data splitting methods and PCA, the highest accuracy was observed with RF and the highest AUC with NB, with 13 PCs in which 80% of the variance was obtained. As a result, the data set consisting of a total of 20 features, expressed by 13 PCs, achieved similar or higher success than the results obtained from the original data set.</p></div>\",\"PeriodicalId\":8449,\"journal\":{\"name\":\"Artificial Intelligence Review\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":10.7000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s10462-024-10904-1.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence Review\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10462-024-10904-1\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-024-10904-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

信用风险评估(CRA)在金融机构的信贷决策过程中发挥着重要作用。如今,大数据分析和机器学习方法的发展标志着信用风险评估进入了一个新时代。近年来,使用机器学习方法进行信用风险评估已成为金融机构的一种替代方法。要创建一个基于机器学习的人工智能信用评分自动预测模型,将被执行 CRA 的人过去的人口和财务数据非常重要。要创建准确的机器学习模型,还必须正确使用特征。本文旨在研究降维和数据拆分步骤对文献中广泛使用的分类算法性能的影响。在我们的研究中,采用主成分分析法(PCA)进行降维。分类算法选择了随机森林(RF)、逻辑回归(LR)、决策树(DT)和奈夫贝叶斯(NB)算法。在将数据集分为训练数据和测试数据时,使用了百分比分割(PER,66-34%)和 k 倍(k = 10)交叉验证技术。获得的结果通过准确率、召回率、F1 分数、精确度和 AUC 指标进行评估。本研究使用的是德国数据集。研究考察了数据分割和降维对 CRA 系统分类的影响。RF 算法在 PER 和 CV 数据拆分中获得了最高的 ACC。在使用数据拆分方法和 PCA 时,RF 算法的准确率最高,NB 算法的 AUC 最高,有 13 个 PC,其中 80% 的方差是通过 PC 获得的。因此,由 13 个 PC 表示的共 20 个特征组成的数据集取得了与原始数据集相似或更高的成功率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Comparison of the impact of dimensionality reduction and data splitting on classification performance in credit risk assessment

Comparison of the impact of dimensionality reduction and data splitting on classification performance in credit risk assessment

Credit risk assessment (CRA) plays an important role in credit decision-making process of financial institutions. Today, developing big data analysis and machine learning methods have marked a new era in credit risk estimation. In recent years, using machine learning methods in credit risk estimation has emerged as an alternative method for financial institutions. The past demographic and financial data of the person whose CRA will be performed is important for creating an automatic artificial intelligence credit score prediction model based on machine learning. It is also necessary to use features correctly to create accurate machine learning models. This article aims to investigate the effects of dimensionality reduction and data splitting steps on the performance of classification algorithms widely used in the literature. In our study, dimensionality reduction was performed using Principal Component Analysis (PCA). Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), Naive Bayes (NB) algorithms were chosen for classification. Percentage splitting (PER, 66–34%) and k-fold (k = 10) cross-validation techniques were used when dividing the data set into training and test data. The results obtained were evaluated with accuracy, recall, F1 score, precision, and AUC metrics. German data set was used in this study. The effect of data splitting and dimension reduction on the classification of CRA systems was examined. The highest ACC in PER and CV data splitting was obtained with the RF algorithm. Using data splitting methods and PCA, the highest accuracy was observed with RF and the highest AUC with NB, with 13 PCs in which 80% of the variance was obtained. As a result, the data set consisting of a total of 20 features, expressed by 13 PCs, achieved similar or higher success than the results obtained from the original data set.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Artificial Intelligence Review
Artificial Intelligence Review 工程技术-计算机:人工智能
CiteScore
22.00
自引率
3.30%
发文量
194
审稿时长
5.3 months
期刊介绍: Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信