{"title":"处理类不平衡的可扩展方法:基于分类等级和计算复杂度的各种技术评价","authors":"Bernhard Schlegel, B. Sick","doi":"10.1109/ICDMW.2017.16","DOIUrl":null,"url":null,"abstract":"Highly imbalanced datasets continue to be a challenge in many data mining applications. It is surprising that state-of-the-art techniques countering class imbalances are usually very computationally expensive and therefore unscalable. Most research effort has been directed into enhancing those techniques, e.g., by focusing on borderline examples or combining multiple techniques. This is usually accompanied by an increased computational complexity, reducing the scalability even further. This article has four major contributions: First, existing techniques to deal with imbalanced datasets are evaluated regarding their computational cost and influence on classification performance on a variety of publicly available datasets and classifiers. Second, a new, scalable technique, class specific scaling (CSS) is proposed as an alternative and compared to the existing techniques. Third, a parameter free class overlap and noise measure is introduced to complement the existing measures to assess the dataset's properties, such as the class balance ratio, and the number of features and samples. This enables a finer categorization of imbalanced datasets. Fourth, based on these measures and basic conditions such as scalability and the used classifier, general recommendations regarding the suitability of the different techniques are derived.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Dealing with Class Imbalance the Scalable Way: Evaluation of Various Techniques Based on Classification Grade and Computational Complexity\",\"authors\":\"Bernhard Schlegel, B. Sick\",\"doi\":\"10.1109/ICDMW.2017.16\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Highly imbalanced datasets continue to be a challenge in many data mining applications. It is surprising that state-of-the-art techniques countering class imbalances are usually very computationally expensive and therefore unscalable. Most research effort has been directed into enhancing those techniques, e.g., by focusing on borderline examples or combining multiple techniques. This is usually accompanied by an increased computational complexity, reducing the scalability even further. This article has four major contributions: First, existing techniques to deal with imbalanced datasets are evaluated regarding their computational cost and influence on classification performance on a variety of publicly available datasets and classifiers. Second, a new, scalable technique, class specific scaling (CSS) is proposed as an alternative and compared to the existing techniques. Third, a parameter free class overlap and noise measure is introduced to complement the existing measures to assess the dataset's properties, such as the class balance ratio, and the number of features and samples. This enables a finer categorization of imbalanced datasets. Fourth, based on these measures and basic conditions such as scalability and the used classifier, general recommendations regarding the suitability of the different techniques are derived.\",\"PeriodicalId\":389183,\"journal\":{\"name\":\"2017 IEEE International Conference on Data Mining Workshops (ICDMW)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Data Mining Workshops (ICDMW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDMW.2017.16\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2017.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Dealing with Class Imbalance the Scalable Way: Evaluation of Various Techniques Based on Classification Grade and Computational Complexity
Highly imbalanced datasets continue to be a challenge in many data mining applications. It is surprising that state-of-the-art techniques countering class imbalances are usually very computationally expensive and therefore unscalable. Most research effort has been directed into enhancing those techniques, e.g., by focusing on borderline examples or combining multiple techniques. This is usually accompanied by an increased computational complexity, reducing the scalability even further. This article has four major contributions: First, existing techniques to deal with imbalanced datasets are evaluated regarding their computational cost and influence on classification performance on a variety of publicly available datasets and classifiers. Second, a new, scalable technique, class specific scaling (CSS) is proposed as an alternative and compared to the existing techniques. Third, a parameter free class overlap and noise measure is introduced to complement the existing measures to assess the dataset's properties, such as the class balance ratio, and the number of features and samples. This enables a finer categorization of imbalanced datasets. Fourth, based on these measures and basic conditions such as scalability and the used classifier, general recommendations regarding the suitability of the different techniques are derived.