{"title":"Entropy-based hybrid sampling (EHS) method to handle class overlap in highly imbalanced dataset","authors":"Anil Kumar, Dinesh Singh, Rama Shankar Yadav","doi":"10.1111/exsy.13679","DOIUrl":null,"url":null,"abstract":"<p>Class imbalance and class overlap create difficulties in the training phase of the standard machine learning algorithm. Its performance is not well in minority classes, especially when there is a high class imbalance and significant class overlap. Recently it has been observed by researchers that, the joint effects of class overlap and imbalance are more harmful as compared to their direct impact. To handle these problems, many methods have been proposed by researchers in past years that can be broadly categorized as data-level, algorithm-level, ensemble learning, and hybrid methods. Existing data-level methods often suffer from problems like information loss and overfitting. To overcome these problems, we introduce a novel entropy-based hybrid sampling (EHS) method to handle class overlap in highly imbalanced datasets. The EHS eliminates less informative majority instances from the overlap region during the undersampling phase and regenerates high informative synthetic minority instances in the oversampling phase near the borderline. The proposed EHS achieved significant improvement in F1-score, G-mean, and AUC performance metrics value by DT, NB, and SVM classifiers as compared to well-established state-of-the-art methods. Classifiers performances are tested on 28 datasets with extreme ranges in imbalance and overlap.</p>","PeriodicalId":51053,"journal":{"name":"Expert Systems","volume":"41 11","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/exsy.13679","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Class imbalance and class overlap create difficulties in the training phase of the standard machine learning algorithm. Its performance is not well in minority classes, especially when there is a high class imbalance and significant class overlap. Recently it has been observed by researchers that, the joint effects of class overlap and imbalance are more harmful as compared to their direct impact. To handle these problems, many methods have been proposed by researchers in past years that can be broadly categorized as data-level, algorithm-level, ensemble learning, and hybrid methods. Existing data-level methods often suffer from problems like information loss and overfitting. To overcome these problems, we introduce a novel entropy-based hybrid sampling (EHS) method to handle class overlap in highly imbalanced datasets. The EHS eliminates less informative majority instances from the overlap region during the undersampling phase and regenerates high informative synthetic minority instances in the oversampling phase near the borderline. The proposed EHS achieved significant improvement in F1-score, G-mean, and AUC performance metrics value by DT, NB, and SVM classifiers as compared to well-established state-of-the-art methods. Classifiers performances are tested on 28 datasets with extreme ranges in imbalance and overlap.
期刊介绍:
Expert Systems: The Journal of Knowledge Engineering publishes papers dealing with all aspects of knowledge engineering, including individual methods and techniques in knowledge acquisition and representation, and their application in the construction of systems – including expert systems – based thereon. Detailed scientific evaluation is an essential part of any paper.
As well as traditional application areas, such as Software and Requirements Engineering, Human-Computer Interaction, and Artificial Intelligence, we are aiming at the new and growing markets for these technologies, such as Business, Economy, Market Research, and Medical and Health Care. The shift towards this new focus will be marked by a series of special issues covering hot and emergent topics.