C. Lemnaru, M. Cuibus, Adrian Bona, Andy S. Alic, R. Potolea
{"title":"A Distributed Methodology for Imbalanced Classification Problems","authors":"C. Lemnaru, M. Cuibus, Adrian Bona, Andy S. Alic, R. Potolea","doi":"10.1109/ISPDC.2012.30","DOIUrl":null,"url":null,"abstract":"Current important challenges in data mining research are triggered by the need to address various particularities of real-world problems, such as imbalanced data and error cost distributions. This paper presents Distributed Evolutionary Cost-Sensitive Balancing, a distributed methodology for dealing with imbalanced data and -- if necessary -- cost distributions. The method employs a genetic algorithm to search for an optimal cost matrix and base classifier settings, which are then employed by a cost-sensitive classifier, wrapped around the base classifier. Individual fitness computation is the most intensive task in the algorithm, but it also presents a high parallelization potential. Two different parallelization alternatives have been explored: a computation-driven approach, and a data-driven approach. Both have been developed within the Apache Watchmaker framework and deployed on Hadoop-based infrastructures. Experimental evaluations performed up to this point have indicated that the computation-driven approach achieves a good classification performance, but does not reduce the running time significantly, the data-driven approach reduces the running time for slow algorithms, such as the kNN and the SVM, while still yielding important performance improvements.","PeriodicalId":287900,"journal":{"name":"2012 11th International Symposium on Parallel and Distributed Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 11th International Symposium on Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPDC.2012.30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Current important challenges in data mining research are triggered by the need to address various particularities of real-world problems, such as imbalanced data and error cost distributions. This paper presents Distributed Evolutionary Cost-Sensitive Balancing, a distributed methodology for dealing with imbalanced data and -- if necessary -- cost distributions. The method employs a genetic algorithm to search for an optimal cost matrix and base classifier settings, which are then employed by a cost-sensitive classifier, wrapped around the base classifier. Individual fitness computation is the most intensive task in the algorithm, but it also presents a high parallelization potential. Two different parallelization alternatives have been explored: a computation-driven approach, and a data-driven approach. Both have been developed within the Apache Watchmaker framework and deployed on Hadoop-based infrastructures. Experimental evaluations performed up to this point have indicated that the computation-driven approach achieves a good classification performance, but does not reduce the running time significantly, the data-driven approach reduces the running time for slow algorithms, such as the kNN and the SVM, while still yielding important performance improvements.