{"title":"基于每个特征值区间的类实例密度的不平衡数据分类重采样方法","authors":"Fei Wang , Ming Zheng , Kai Ma , Xiaowen Hu","doi":"10.1016/j.ins.2024.121570","DOIUrl":null,"url":null,"abstract":"<div><div>In practical applications, imbalanced datasets significantly degrade the classification performance of machine learning models. However, most conventional resampling approaches fall short in adequately addressing the varying contributions of individual features to the classification model. In response to this defect, this study introduces three novel resampling approaches. The first approach, Oversampling based on class instance density per feature value intervals (OCF), focuses on augmenting the dataset. The second approach, Undersampling based on class instance density per feature value intervals (UCF), seeks to reduce dataset size. The third approach, Hybrid sampling based on class instance density per feature value intervals (HSCF), which can perform oversampling and undersampling simultaneously. These approaches categorize feature value into different intervals based on their varying information content, calculate class instance densities within these intervals, and generate feature values in intervals with high discriminative information. Subsequently, these generated features are combined to synthesize minority class data, effectively achieving oversampling. Additionally, the study combines class instance density and feature importance to identify majority class data at the classification boundary with minimal contribution and subsequently executes undersampling. The flexibility to adjust sampling ratios and the integration of OCF and UCF enable the implementation of hybrid sampling. Finally, experiments on the benchmark dataset demonstrate the superiority and effectiveness of the proposed method. Furthermore, it is observed that the method proposed in this study enhances the feature dividing capability of decision tree classifiers. Hence, the best results are achieved when working in synergy with decision tree classifiers, leading to the most significant improvements in classification performance. All codes have been published at <span><span>https://github.com/Wangfeiopen/HS</span><svg><path></path></svg></span><span><span>CF</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"692 ","pages":"Article 121570"},"PeriodicalIF":8.1000,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Resampling approach for imbalanced data classification based on class instance density per feature value intervals\",\"authors\":\"Fei Wang , Ming Zheng , Kai Ma , Xiaowen Hu\",\"doi\":\"10.1016/j.ins.2024.121570\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In practical applications, imbalanced datasets significantly degrade the classification performance of machine learning models. However, most conventional resampling approaches fall short in adequately addressing the varying contributions of individual features to the classification model. In response to this defect, this study introduces three novel resampling approaches. The first approach, Oversampling based on class instance density per feature value intervals (OCF), focuses on augmenting the dataset. The second approach, Undersampling based on class instance density per feature value intervals (UCF), seeks to reduce dataset size. The third approach, Hybrid sampling based on class instance density per feature value intervals (HSCF), which can perform oversampling and undersampling simultaneously. These approaches categorize feature value into different intervals based on their varying information content, calculate class instance densities within these intervals, and generate feature values in intervals with high discriminative information. Subsequently, these generated features are combined to synthesize minority class data, effectively achieving oversampling. Additionally, the study combines class instance density and feature importance to identify majority class data at the classification boundary with minimal contribution and subsequently executes undersampling. The flexibility to adjust sampling ratios and the integration of OCF and UCF enable the implementation of hybrid sampling. Finally, experiments on the benchmark dataset demonstrate the superiority and effectiveness of the proposed method. Furthermore, it is observed that the method proposed in this study enhances the feature dividing capability of decision tree classifiers. Hence, the best results are achieved when working in synergy with decision tree classifiers, leading to the most significant improvements in classification performance. All codes have been published at <span><span>https://github.com/Wangfeiopen/HS</span><svg><path></path></svg></span><span><span>CF</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":51063,\"journal\":{\"name\":\"Information Sciences\",\"volume\":\"692 \",\"pages\":\"Article 121570\"},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2024-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0020025524014841\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020025524014841","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Resampling approach for imbalanced data classification based on class instance density per feature value intervals
In practical applications, imbalanced datasets significantly degrade the classification performance of machine learning models. However, most conventional resampling approaches fall short in adequately addressing the varying contributions of individual features to the classification model. In response to this defect, this study introduces three novel resampling approaches. The first approach, Oversampling based on class instance density per feature value intervals (OCF), focuses on augmenting the dataset. The second approach, Undersampling based on class instance density per feature value intervals (UCF), seeks to reduce dataset size. The third approach, Hybrid sampling based on class instance density per feature value intervals (HSCF), which can perform oversampling and undersampling simultaneously. These approaches categorize feature value into different intervals based on their varying information content, calculate class instance densities within these intervals, and generate feature values in intervals with high discriminative information. Subsequently, these generated features are combined to synthesize minority class data, effectively achieving oversampling. Additionally, the study combines class instance density and feature importance to identify majority class data at the classification boundary with minimal contribution and subsequently executes undersampling. The flexibility to adjust sampling ratios and the integration of OCF and UCF enable the implementation of hybrid sampling. Finally, experiments on the benchmark dataset demonstrate the superiority and effectiveness of the proposed method. Furthermore, it is observed that the method proposed in this study enhances the feature dividing capability of decision tree classifiers. Hence, the best results are achieved when working in synergy with decision tree classifiers, leading to the most significant improvements in classification performance. All codes have been published at https://github.com/Wangfeiopen/HSCF.
期刊介绍:
Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions.
Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.