不平衡分类问题的分布式方法

2012 11th International Symposium on Parallel and Distributed Computing Pub Date : 2012-06-25 DOI:10.1109/ISPDC.2012.30

C. Lemnaru, M. Cuibus, Adrian Bona, Andy S. Alic, R. Potolea

{"title":"不平衡分类问题的分布式方法","authors":"C. Lemnaru, M. Cuibus, Adrian Bona, Andy S. Alic, R. Potolea","doi":"10.1109/ISPDC.2012.30","DOIUrl":null,"url":null,"abstract":"Current important challenges in data mining research are triggered by the need to address various particularities of real-world problems, such as imbalanced data and error cost distributions. This paper presents Distributed Evolutionary Cost-Sensitive Balancing, a distributed methodology for dealing with imbalanced data and -- if necessary -- cost distributions. The method employs a genetic algorithm to search for an optimal cost matrix and base classifier settings, which are then employed by a cost-sensitive classifier, wrapped around the base classifier. Individual fitness computation is the most intensive task in the algorithm, but it also presents a high parallelization potential. Two different parallelization alternatives have been explored: a computation-driven approach, and a data-driven approach. Both have been developed within the Apache Watchmaker framework and deployed on Hadoop-based infrastructures. Experimental evaluations performed up to this point have indicated that the computation-driven approach achieves a good classification performance, but does not reduce the running time significantly, the data-driven approach reduces the running time for slow algorithms, such as the kNN and the SVM, while still yielding important performance improvements.","PeriodicalId":287900,"journal":{"name":"2012 11th International Symposium on Parallel and Distributed Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"A Distributed Methodology for Imbalanced Classification Problems\",\"authors\":\"C. Lemnaru, M. Cuibus, Adrian Bona, Andy S. Alic, R. Potolea\",\"doi\":\"10.1109/ISPDC.2012.30\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current important challenges in data mining research are triggered by the need to address various particularities of real-world problems, such as imbalanced data and error cost distributions. This paper presents Distributed Evolutionary Cost-Sensitive Balancing, a distributed methodology for dealing with imbalanced data and -- if necessary -- cost distributions. The method employs a genetic algorithm to search for an optimal cost matrix and base classifier settings, which are then employed by a cost-sensitive classifier, wrapped around the base classifier. Individual fitness computation is the most intensive task in the algorithm, but it also presents a high parallelization potential. Two different parallelization alternatives have been explored: a computation-driven approach, and a data-driven approach. Both have been developed within the Apache Watchmaker framework and deployed on Hadoop-based infrastructures. Experimental evaluations performed up to this point have indicated that the computation-driven approach achieves a good classification performance, but does not reduce the running time significantly, the data-driven approach reduces the running time for slow algorithms, such as the kNN and the SVM, while still yielding important performance improvements.\",\"PeriodicalId\":287900,\"journal\":{\"name\":\"2012 11th International Symposium on Parallel and Distributed Computing\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 11th International Symposium on Parallel and Distributed Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISPDC.2012.30\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 11th International Symposium on Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPDC.2012.30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

当前数据挖掘研究的重要挑战是由于需要解决现实世界问题的各种特殊性，如数据不平衡和错误成本分布。本文提出了分布式进化成本敏感平衡，这是一种处理不平衡数据和(如果必要的话)成本分布的分布式方法。该方法采用遗传算法搜索最优代价矩阵和基分类器设置，然后将其包裹在基分类器周围，由代价敏感分类器使用。个体适应度计算是算法中最繁重的任务，但它也具有很高的并行化潜力。已经探索了两种不同的并行化替代方案:计算驱动的方法和数据驱动的方法。两者都是在Apache Watchmaker框架内开发的，并部署在基于hadoop的基础设施上。到目前为止进行的实验评估表明，计算驱动的方法获得了良好的分类性能，但并没有显着减少运行时间，数据驱动的方法减少了慢算法(如kNN和SVM)的运行时间，同时仍然产生了重要的性能改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Distributed Methodology for Imbalanced Classification Problems

Current important challenges in data mining research are triggered by the need to address various particularities of real-world problems, such as imbalanced data and error cost distributions. This paper presents Distributed Evolutionary Cost-Sensitive Balancing, a distributed methodology for dealing with imbalanced data and -- if necessary -- cost distributions. The method employs a genetic algorithm to search for an optimal cost matrix and base classifier settings, which are then employed by a cost-sensitive classifier, wrapped around the base classifier. Individual fitness computation is the most intensive task in the algorithm, but it also presents a high parallelization potential. Two different parallelization alternatives have been explored: a computation-driven approach, and a data-driven approach. Both have been developed within the Apache Watchmaker framework and deployed on Hadoop-based infrastructures. Experimental evaluations performed up to this point have indicated that the computation-driven approach achieves a good classification performance, but does not reduce the running time significantly, the data-driven approach reduces the running time for slow algorithms, such as the kNN and the SVM, while still yielding important performance improvements.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 11th International Symposium on Parallel and Distributed Computing

自引率

0.00%

发文量