Data Quality Improvement for Financial Distress Prediction: Feature Selection, Data Re-Sampling, and Their Combinations in Different Orders

IF 2.7 3区经济学 Q1 ECONOMICS

Journal of Forecasting Pub Date : 2025-06-18 DOI:10.1002/for.70002

Chih-Fong Tsai, Wei-Chao Lin, Yi-Hsien Chen

{"title":"Data Quality Improvement for Financial Distress Prediction: Feature Selection, Data Re-Sampling, and Their Combinations in Different Orders","authors":"Chih-Fong Tsai, Wei-Chao Lin, Yi-Hsien Chen","doi":"10.1002/for.70002","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>In financial distress prediction (FDP), it is very important to ensure the quality of the data for developing effective prediction models. Related studies often apply feature selection to filter out some unrepresentative features from a set of financial ratios, or data re-sampling to re-balance class imbalanced FDP training sets. Although these two types of data pre-processing methods have been demonstrated their effectiveness, they have not often been applied at the same time to develop FDP models. Moreover, the performances of various feature selection algorithms, which can be divided into filter, wrapper, and embedded methods, and data re-sampling algorithms, which can be divided into under-sampling, over-sampling, and hybrid sampling methods, have not been fully investigated in FDP. Therefore, in this study several feature selection and data re-sampling methods, which are employed alone and in combination by different orders are compared. The experimental results based on nine FDP datasets show that executing data re-sampling alone always outperforms executing feature selection alone to develop FDP models, in which hybrid sampling is the better choice. In most cases, better prediction performances can be obtained by performing feature selection first and data re-sampling second. The best combined algorithms are based on the decision tree method for feature selection and Synthetic Minority Over-sampling Technique-Edited Nearest Neighbors (SMOTE-ENN) for hybrid sampling. This combination allows the random forest classifier to produce the highest rate of prediction accuracy. On the other hand, for the Type I error, where crisis cases are misclassified into the non-crisis class, the lowest error rate is produced by executing under-sampling alone using the ClusterCentroids algorithm combined with the random forest classifier.</p>\n </div>","PeriodicalId":47835,"journal":{"name":"Journal of Forecasting","volume":"44 7","pages":"2205-2229"},"PeriodicalIF":2.7000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Forecasting","FirstCategoryId":"96","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/for.70002","RegionNum":3,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}

引用次数: 0

Abstract

In financial distress prediction (FDP), it is very important to ensure the quality of the data for developing effective prediction models. Related studies often apply feature selection to filter out some unrepresentative features from a set of financial ratios, or data re-sampling to re-balance class imbalanced FDP training sets. Although these two types of data pre-processing methods have been demonstrated their effectiveness, they have not often been applied at the same time to develop FDP models. Moreover, the performances of various feature selection algorithms, which can be divided into filter, wrapper, and embedded methods, and data re-sampling algorithms, which can be divided into under-sampling, over-sampling, and hybrid sampling methods, have not been fully investigated in FDP. Therefore, in this study several feature selection and data re-sampling methods, which are employed alone and in combination by different orders are compared. The experimental results based on nine FDP datasets show that executing data re-sampling alone always outperforms executing feature selection alone to develop FDP models, in which hybrid sampling is the better choice. In most cases, better prediction performances can be obtained by performing feature selection first and data re-sampling second. The best combined algorithms are based on the decision tree method for feature selection and Synthetic Minority Over-sampling Technique-Edited Nearest Neighbors (SMOTE-ENN) for hybrid sampling. This combination allows the random forest classifier to produce the highest rate of prediction accuracy. On the other hand, for the Type I error, where crisis cases are misclassified into the non-crisis class, the lowest error rate is produced by executing under-sampling alone using the ClusterCentroids algorithm combined with the random forest classifier.

Abstract Image

查看原文本刊更多论文

财务困境预测的数据质量改进：特征选择、数据重采样及其不同顺序的组合

在财务困境预测中，保证数据质量是建立有效预测模型的关键。相关研究通常采用特征选择来从一组财务比率中过滤掉一些不具代表性的特征，或者采用数据重采样来重新平衡类不平衡的FDP训练集。虽然这两种类型的数据预处理方法已经证明了它们的有效性，但它们并不经常同时应用于开发FDP模型。此外，各种特征选择算法（可分为滤波、包装和嵌入方法）和数据重采样算法（可分为欠采样、过采样和混合采样方法）的性能在FDP中尚未得到充分研究。因此，本研究比较了不同阶次单独使用和组合使用的几种特征选择和数据重采样方法。基于9个FDP数据集的实验结果表明，单独执行数据重采样总是优于单独执行特征选择来开发FDP模型，其中混合采样是更好的选择。在大多数情况下，先进行特征选择，再进行数据重采样可以获得更好的预测性能。最佳组合算法是基于特征选择的决策树方法和混合采样的合成少数过采样技术-编辑近邻（SMOTE-ENN）。这种组合允许随机森林分类器产生最高的预测准确率。另一方面，对于第一类错误，危机案例被错误地分类为非危机类，通过使用ClusterCentroids算法和随机森林分类器单独执行不足采样产生的错误率最低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Forecasting Multiple-

CiteScore

5.40

自引率

5.90%

发文量

期刊介绍： The Journal of Forecasting is an international journal that publishes refereed papers on forecasting. It is multidisciplinary, welcoming papers dealing with any aspect of forecasting: theoretical, practical, computational and methodological. A broad interpretation of the topic is taken with approaches from various subject areas, such as statistics, economics, psychology, systems engineering and social sciences, all encouraged. Furthermore, the Journal welcomes a wide diversity of applications in such fields as business, government, technology and the environment. Of particular interest are papers dealing with modelling issues and the relationship of forecasting systems to decision-making processes.