加速随机森林分类器:多核、GP-GPU还是FPGA?

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI:10.1109/FCCM.2012.47

B. V. Essen, C. Macaraeg, M. Gokhale, R. Prenger

{"title":"加速随机森林分类器:多核、GP-GPU还是FPGA?","authors":"B. V. Essen, C. Macaraeg, M. Gokhale, R. Prenger","doi":"10.1109/FCCM.2012.47","DOIUrl":null,"url":null,"abstract":"Random forest classification is a well known machine learning technique that generates classifiers in the form of an ensemble (\"forest\") of decision trees. The classification of an input sample is determined by the majority classification by the ensemble. Traditional random forest classifiers can be highly effective, but classification using a random forest is memory bound and not typically suitable for acceleration using FPGAs or GP-GPUs due to the need to traverse large, possibly irregular decision trees. Recent work at Lawrence Livermore National Laboratory has developed several variants of random forest classifiers, including the Compact Random Forest (CRF), that can generate decision trees more suitable for acceleration than traditional decision trees. Our paper compares and contrasts the effectiveness of FPGAs, GP-GPUs, and multi-core CPUs for accelerating classification using models generated by compact random forest machine learning classifiers. Taking advantage of training algorithms that can produce compact random forests composed of many, small trees rather than fewer, deep trees, we are able to regularize the forest such that the classification of any sample takes a deterministic amount of time. This optimization then allows us to execute the classifier in a pipelined or single-instruction multiple thread (SIMT) fashion. We show that FPGAs provide the highest performance solution, but require a multi-chip / multi-board system to execute even modest sized forests. GP-GPUs offer a more flexible solution with reasonably high performance that scales with forest size. Finally, multi-threading via Open MP on a shared memory system was the simplest solution and provided near linear performance that scaled with core count, but was still significantly slower than the GP-GPU and FPGA.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"148","resultStr":"{\"title\":\"Accelerating a Random Forest Classifier: Multi-Core, GP-GPU, or FPGA?\",\"authors\":\"B. V. Essen, C. Macaraeg, M. Gokhale, R. Prenger\",\"doi\":\"10.1109/FCCM.2012.47\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Random forest classification is a well known machine learning technique that generates classifiers in the form of an ensemble (\\\"forest\\\") of decision trees. The classification of an input sample is determined by the majority classification by the ensemble. Traditional random forest classifiers can be highly effective, but classification using a random forest is memory bound and not typically suitable for acceleration using FPGAs or GP-GPUs due to the need to traverse large, possibly irregular decision trees. Recent work at Lawrence Livermore National Laboratory has developed several variants of random forest classifiers, including the Compact Random Forest (CRF), that can generate decision trees more suitable for acceleration than traditional decision trees. Our paper compares and contrasts the effectiveness of FPGAs, GP-GPUs, and multi-core CPUs for accelerating classification using models generated by compact random forest machine learning classifiers. Taking advantage of training algorithms that can produce compact random forests composed of many, small trees rather than fewer, deep trees, we are able to regularize the forest such that the classification of any sample takes a deterministic amount of time. This optimization then allows us to execute the classifier in a pipelined or single-instruction multiple thread (SIMT) fashion. We show that FPGAs provide the highest performance solution, but require a multi-chip / multi-board system to execute even modest sized forests. GP-GPUs offer a more flexible solution with reasonably high performance that scales with forest size. Finally, multi-threading via Open MP on a shared memory system was the simplest solution and provided near linear performance that scaled with core count, but was still significantly slower than the GP-GPU and FPGA.\",\"PeriodicalId\":226197,\"journal\":{\"name\":\"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-04-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"148\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FCCM.2012.47\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2012.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 148

摘要

随机森林分类是一种众所周知的机器学习技术，它以决策树的集合(“森林”)的形式生成分类器。输入样本的分类由集合的多数分类决定。传统的随机森林分类器可能非常有效，但使用随机森林进行分类是受内存限制的，由于需要遍历大型且可能不规则的决策树，因此通常不适合使用fpga或gp - gpu进行加速。劳伦斯利弗莫尔国家实验室最近的工作开发了几种随机森林分类器的变体，包括紧凑随机森林(CRF)，它可以生成比传统决策树更适合加速的决策树。我们的论文比较和对比了fpga、gp - gpu和多核cpu在使用紧凑随机森林机器学习分类器生成的模型加速分类方面的有效性。利用训练算法可以产生由许多小树而不是更少的深树组成的紧凑随机森林，我们能够正则化森林，这样任何样本的分类都需要确定的时间。然后，这种优化允许我们以流水线或单指令多线程(SIMT)的方式执行分类器。我们表明fpga提供了最高性能的解决方案，但需要多芯片/多板系统来执行中等规模的森林。gp - gpu提供了更灵活的解决方案，具有相当高的性能，可以随森林大小缩放。最后，在共享内存系统上通过Open MP实现多线程是最简单的解决方案，它提供了接近线性的性能，随着内核数量的增加而扩展，但仍然比GP-GPU和FPGA慢得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating a Random Forest Classifier: Multi-Core, GP-GPU, or FPGA?

Random forest classification is a well known machine learning technique that generates classifiers in the form of an ensemble ("forest") of decision trees. The classification of an input sample is determined by the majority classification by the ensemble. Traditional random forest classifiers can be highly effective, but classification using a random forest is memory bound and not typically suitable for acceleration using FPGAs or GP-GPUs due to the need to traverse large, possibly irregular decision trees. Recent work at Lawrence Livermore National Laboratory has developed several variants of random forest classifiers, including the Compact Random Forest (CRF), that can generate decision trees more suitable for acceleration than traditional decision trees. Our paper compares and contrasts the effectiveness of FPGAs, GP-GPUs, and multi-core CPUs for accelerating classification using models generated by compact random forest machine learning classifiers. Taking advantage of training algorithms that can produce compact random forests composed of many, small trees rather than fewer, deep trees, we are able to regularize the forest such that the classification of any sample takes a deterministic amount of time. This optimization then allows us to execute the classifier in a pipelined or single-instruction multiple thread (SIMT) fashion. We show that FPGAs provide the highest performance solution, but require a multi-chip / multi-board system to execute even modest sized forests. GP-GPUs offer a more flexible solution with reasonably high performance that scales with forest size. Finally, multi-threading via Open MP on a shared memory system was the simplest solution and provided near linear performance that scaled with core count, but was still significantly slower than the GP-GPU and FPGA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines

自引率

0.00%

发文量