{"title":"Decision tree rule-based feature selection for large-scale imbalanced data","authors":"Haoyue Liu, Mengchu Zhou","doi":"10.1109/WOCC.2017.7928973","DOIUrl":null,"url":null,"abstract":"A class imbalance problem often appears in many real world applications, e.g. fault diagnosis, text categorization, fraud detection. When dealing with a large-scale imbalanced dataset, feature selection becomes a great challenge. To confront it, this work proposes a feature selection approach based on a decision tree rule. The effectiveness of the proposed approach is verified by classifying a large-scale dataset from Santander Bank. The results show that our approach can achieve higher Area Under the Curve (AUC) and less computational time. We also compare it with filter-based feature selection approaches, i.e., Chi-Square and F-statistic. The results show that it outperforms them but needs slightly more computational efforts.","PeriodicalId":6471,"journal":{"name":"2017 26th Wireless and Optical Communication Conference (WOCC)","volume":"48 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 26th Wireless and Optical Communication Conference (WOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WOCC.2017.7928973","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
A class imbalance problem often appears in many real world applications, e.g. fault diagnosis, text categorization, fraud detection. When dealing with a large-scale imbalanced dataset, feature selection becomes a great challenge. To confront it, this work proposes a feature selection approach based on a decision tree rule. The effectiveness of the proposed approach is verified by classifying a large-scale dataset from Santander Bank. The results show that our approach can achieve higher Area Under the Curve (AUC) and less computational time. We also compare it with filter-based feature selection approaches, i.e., Chi-Square and F-statistic. The results show that it outperforms them but needs slightly more computational efforts.