Accelerating Random Forest Classification on GPU and FPGA

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545067

Milan Shah, Reece Neff, Hancheng Wu, Marco Minutoli, Antonino Tumeo, M. Becchi

{"title":"Accelerating Random Forest Classification on GPU and FPGA","authors":"Milan Shah, Reece Neff, Hancheng Wu, Marco Minutoli, Antonino Tumeo, M. Becchi","doi":"10.1145/3545008.3545067","DOIUrl":null,"url":null,"abstract":"Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification. In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification. In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library.

查看原文本刊更多论文

基于GPU和FPGA的随机森林分类加速

随机森林(RFs)是一种常用的机器学习方法，用于分类和回归任务，涵盖各种应用领域，包括生物信息学，商业分析和软件优化。虽然之前的工作主要集中在提高射频训练的性能上，但许多应用，如恶意软件识别、癌症预测和银行欺诈检测，都需要快速的射频分类。在这项工作中，我们加速了GPU和FPGA上的射频分类。为了提供对大型数据集的有效支持，我们提出了一种适合GPU/FPGA存储器层次结构的分层存储器布局。我们基于该布局设计了三种RF分类代码变体，并研究了这些内核的GPU和fpga特定考虑因素。我们的实验评估是在Nvidia Xp GPU和Xilinx Alveo U250 FPGA加速卡上进行的，使用了数百万个样本和数十个特征的公开数据集，涵盖了各个方面。首先，我们评估了分层数据结构相对于标准压缩稀疏行(CSR)格式的性能优势。其次，我们将我们的GPU实现与cuML(一个针对Nvidia GPU的机器学习库)进行比较。第三，我们探讨了在RF中使用不同树深度所导致的性能/精度权衡。最后，我们对我们的GPU和FPGA实现进行了性能比较分析。我们的评估表明，虽然报告GPU上的最佳性能，但我们的代码变体在GPU和FPGA上的性能都优于CSR基线。对于高精度目标，我们的GPU实现比CSR产生5-9倍的加速，比Nvidia的cuML库产生高达2倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量