Chris Seiffert, T. Khoshgoftaar, J. V. Hulse, Amri Napolitano
{"title":"数据抽样与代价敏感学习的比较研究","authors":"Chris Seiffert, T. Khoshgoftaar, J. V. Hulse, Amri Napolitano","doi":"10.1109/ICDMW.2008.119","DOIUrl":null,"url":null,"abstract":"Two common challenges data mining and machine learning practitioners face in many application domains are unequal classification costs and class imbalance. Most traditional data mining techniques attempt to maximize overall accuracy rather than minimize cost. When data is imbalanced, such techniques result in models that highly favor the over represented class, the class which typically carries a lower cost of misclassification. Two techniques that have been used to address both of these issues are cost sensitive learning and data sampling. In this work, we investigate the performance of two cost sensitive learning techniques and four data sampling techniques for minimizing classification costs when data is imbalanced. We present a comprehensive suite of experiments, utilizing 15 datasets with 10 cost ratios, which have been carefully designed to ensure conclusive, significant and reliable results.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"54","resultStr":"{\"title\":\"A Comparative Study of Data Sampling and Cost Sensitive Learning\",\"authors\":\"Chris Seiffert, T. Khoshgoftaar, J. V. Hulse, Amri Napolitano\",\"doi\":\"10.1109/ICDMW.2008.119\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Two common challenges data mining and machine learning practitioners face in many application domains are unequal classification costs and class imbalance. Most traditional data mining techniques attempt to maximize overall accuracy rather than minimize cost. When data is imbalanced, such techniques result in models that highly favor the over represented class, the class which typically carries a lower cost of misclassification. Two techniques that have been used to address both of these issues are cost sensitive learning and data sampling. In this work, we investigate the performance of two cost sensitive learning techniques and four data sampling techniques for minimizing classification costs when data is imbalanced. We present a comprehensive suite of experiments, utilizing 15 datasets with 10 cost ratios, which have been carefully designed to ensure conclusive, significant and reliable results.\",\"PeriodicalId\":175955,\"journal\":{\"name\":\"2008 IEEE International Conference on Data Mining Workshops\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"54\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 IEEE International Conference on Data Mining Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDMW.2008.119\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 IEEE International Conference on Data Mining Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2008.119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Comparative Study of Data Sampling and Cost Sensitive Learning
Two common challenges data mining and machine learning practitioners face in many application domains are unequal classification costs and class imbalance. Most traditional data mining techniques attempt to maximize overall accuracy rather than minimize cost. When data is imbalanced, such techniques result in models that highly favor the over represented class, the class which typically carries a lower cost of misclassification. Two techniques that have been used to address both of these issues are cost sensitive learning and data sampling. In this work, we investigate the performance of two cost sensitive learning techniques and four data sampling techniques for minimizing classification costs when data is imbalanced. We present a comprehensive suite of experiments, utilizing 15 datasets with 10 cost ratios, which have been carefully designed to ensure conclusive, significant and reliable results.