{"title":"财务困境预测的数据质量改进:特征选择、数据重采样及其不同顺序的组合","authors":"Chih-Fong Tsai, Wei-Chao Lin, Yi-Hsien Chen","doi":"10.1002/for.70002","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>In financial distress prediction (FDP), it is very important to ensure the quality of the data for developing effective prediction models. Related studies often apply feature selection to filter out some unrepresentative features from a set of financial ratios, or data re-sampling to re-balance class imbalanced FDP training sets. Although these two types of data pre-processing methods have been demonstrated their effectiveness, they have not often been applied at the same time to develop FDP models. Moreover, the performances of various feature selection algorithms, which can be divided into filter, wrapper, and embedded methods, and data re-sampling algorithms, which can be divided into under-sampling, over-sampling, and hybrid sampling methods, have not been fully investigated in FDP. Therefore, in this study several feature selection and data re-sampling methods, which are employed alone and in combination by different orders are compared. The experimental results based on nine FDP datasets show that executing data re-sampling alone always outperforms executing feature selection alone to develop FDP models, in which hybrid sampling is the better choice. In most cases, better prediction performances can be obtained by performing feature selection first and data re-sampling second. The best combined algorithms are based on the decision tree method for feature selection and Synthetic Minority Over-sampling Technique-Edited Nearest Neighbors (SMOTE-ENN) for hybrid sampling. This combination allows the random forest classifier to produce the highest rate of prediction accuracy. On the other hand, for the Type I error, where crisis cases are misclassified into the non-crisis class, the lowest error rate is produced by executing under-sampling alone using the ClusterCentroids algorithm combined with the random forest classifier.</p>\n </div>","PeriodicalId":47835,"journal":{"name":"Journal of Forecasting","volume":"44 7","pages":"2205-2229"},"PeriodicalIF":2.7000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data Quality Improvement for Financial Distress Prediction: Feature Selection, Data Re-Sampling, and Their Combinations in Different Orders\",\"authors\":\"Chih-Fong Tsai, Wei-Chao Lin, Yi-Hsien Chen\",\"doi\":\"10.1002/for.70002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>In financial distress prediction (FDP), it is very important to ensure the quality of the data for developing effective prediction models. Related studies often apply feature selection to filter out some unrepresentative features from a set of financial ratios, or data re-sampling to re-balance class imbalanced FDP training sets. Although these two types of data pre-processing methods have been demonstrated their effectiveness, they have not often been applied at the same time to develop FDP models. Moreover, the performances of various feature selection algorithms, which can be divided into filter, wrapper, and embedded methods, and data re-sampling algorithms, which can be divided into under-sampling, over-sampling, and hybrid sampling methods, have not been fully investigated in FDP. Therefore, in this study several feature selection and data re-sampling methods, which are employed alone and in combination by different orders are compared. The experimental results based on nine FDP datasets show that executing data re-sampling alone always outperforms executing feature selection alone to develop FDP models, in which hybrid sampling is the better choice. In most cases, better prediction performances can be obtained by performing feature selection first and data re-sampling second. The best combined algorithms are based on the decision tree method for feature selection and Synthetic Minority Over-sampling Technique-Edited Nearest Neighbors (SMOTE-ENN) for hybrid sampling. This combination allows the random forest classifier to produce the highest rate of prediction accuracy. On the other hand, for the Type I error, where crisis cases are misclassified into the non-crisis class, the lowest error rate is produced by executing under-sampling alone using the ClusterCentroids algorithm combined with the random forest classifier.</p>\\n </div>\",\"PeriodicalId\":47835,\"journal\":{\"name\":\"Journal of Forecasting\",\"volume\":\"44 7\",\"pages\":\"2205-2229\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2025-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Forecasting\",\"FirstCategoryId\":\"96\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/for.70002\",\"RegionNum\":3,\"RegionCategory\":\"经济学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ECONOMICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Forecasting","FirstCategoryId":"96","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/for.70002","RegionNum":3,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}
Data Quality Improvement for Financial Distress Prediction: Feature Selection, Data Re-Sampling, and Their Combinations in Different Orders
In financial distress prediction (FDP), it is very important to ensure the quality of the data for developing effective prediction models. Related studies often apply feature selection to filter out some unrepresentative features from a set of financial ratios, or data re-sampling to re-balance class imbalanced FDP training sets. Although these two types of data pre-processing methods have been demonstrated their effectiveness, they have not often been applied at the same time to develop FDP models. Moreover, the performances of various feature selection algorithms, which can be divided into filter, wrapper, and embedded methods, and data re-sampling algorithms, which can be divided into under-sampling, over-sampling, and hybrid sampling methods, have not been fully investigated in FDP. Therefore, in this study several feature selection and data re-sampling methods, which are employed alone and in combination by different orders are compared. The experimental results based on nine FDP datasets show that executing data re-sampling alone always outperforms executing feature selection alone to develop FDP models, in which hybrid sampling is the better choice. In most cases, better prediction performances can be obtained by performing feature selection first and data re-sampling second. The best combined algorithms are based on the decision tree method for feature selection and Synthetic Minority Over-sampling Technique-Edited Nearest Neighbors (SMOTE-ENN) for hybrid sampling. This combination allows the random forest classifier to produce the highest rate of prediction accuracy. On the other hand, for the Type I error, where crisis cases are misclassified into the non-crisis class, the lowest error rate is produced by executing under-sampling alone using the ClusterCentroids algorithm combined with the random forest classifier.
期刊介绍:
The Journal of Forecasting is an international journal that publishes refereed papers on forecasting. It is multidisciplinary, welcoming papers dealing with any aspect of forecasting: theoretical, practical, computational and methodological. A broad interpretation of the topic is taken with approaches from various subject areas, such as statistics, economics, psychology, systems engineering and social sciences, all encouraged. Furthermore, the Journal welcomes a wide diversity of applications in such fields as business, government, technology and the environment. Of particular interest are papers dealing with modelling issues and the relationship of forecasting systems to decision-making processes.