Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR

IF 65.3 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Foundations and Trends in Machine Learning Pub Date : 2020-11-01 DOI:10.1109/MLHPCAI4S51975.2020.00010

S. M. Ghazimirsaeed, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda

{"title":"Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR","authors":"S. M. Ghazimirsaeed, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/MLHPCAI4S51975.2020.00010","DOIUrl":null,"url":null,"abstract":"The growth of big data applications during the last decade has led to a surge in the deployment and popularity of machine learning (ML) libraries. On the other hand, the high performance offered by GPUs makes them well suited for ML problems. To take advantage of GPU performance for ML, NVIDIA has recently developed the cuML library. cuML is the GPU counterpart of Scikit-learn, and provides similar Pythonic interfaces to Scikit-learn while hiding the complexities of writing GPU compute kernels directly using CUDA. To support execution of ML workloads on Multi-Node Multi- GPU (MNMG) systems, the cuML library exploits the NVIDIA Collective Communications Library (NCCL) as a backend for collective communications between processes. On the other hand, MPI is a de facto standard for communication in HPC systems. Among various MPI libraries, MVAPICH2-GDR is the pioneer in optimizing GPU communication.This paper explores various aspects and challenges of providing MPI-based communication support for GPU-accelerated cuML applications. More specifically, it proposes a Python API to take advantage of MPI-based communications for cuML applications. It also gives an in-depth analysis, characterization, and benchmarking of the cuML algorithms such as K-Means, Nearest Neighbors, Random Forest, and tSVD. Moreover, it provides a comprehensive performance evaluation and profiling study for MPI-based versus NCCL-based communication for these algorithms. The evaluation results show that the proposed MPI-based communication approach achieves up to 1.6x, 1.25x, 1.25x, and 1.36x speedup for K-Means, Nearest Neighbors, Linear Regression, and tSVD, respectively on up to 32 GPUs.","PeriodicalId":47667,"journal":{"name":"Foundations and Trends in Machine Learning","volume":"31 1","pages":"1-12"},"PeriodicalIF":65.3000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foundations and Trends in Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLHPCAI4S51975.2020.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The growth of big data applications during the last decade has led to a surge in the deployment and popularity of machine learning (ML) libraries. On the other hand, the high performance offered by GPUs makes them well suited for ML problems. To take advantage of GPU performance for ML, NVIDIA has recently developed the cuML library. cuML is the GPU counterpart of Scikit-learn, and provides similar Pythonic interfaces to Scikit-learn while hiding the complexities of writing GPU compute kernels directly using CUDA. To support execution of ML workloads on Multi-Node Multi- GPU (MNMG) systems, the cuML library exploits the NVIDIA Collective Communications Library (NCCL) as a backend for collective communications between processes. On the other hand, MPI is a de facto standard for communication in HPC systems. Among various MPI libraries, MVAPICH2-GDR is the pioneer in optimizing GPU communication.This paper explores various aspects and challenges of providing MPI-based communication support for GPU-accelerated cuML applications. More specifically, it proposes a Python API to take advantage of MPI-based communications for cuML applications. It also gives an in-depth analysis, characterization, and benchmarking of the cuML algorithms such as K-Means, Nearest Neighbors, Random Forest, and tSVD. Moreover, it provides a comprehensive performance evaluation and profiling study for MPI-based versus NCCL-based communication for these algorithms. The evaluation results show that the proposed MPI-based communication approach achieves up to 1.6x, 1.25x, 1.25x, and 1.36x speedup for K-Means, Nearest Neighbors, Linear Regression, and tSVD, respectively on up to 32 GPUs.

查看原文本刊更多论文

使用MPI库在Python中加速基于gpu的机器学习:MVAPICH2-GDR的案例研究

在过去十年中，大数据应用的增长导致机器学习(ML)库的部署和普及激增。另一方面，gpu提供的高性能使它们非常适合ML问题。为了利用GPU在ML中的性能优势，NVIDIA最近开发了cuML库。cuML是Scikit-learn的GPU版本，为Scikit-learn提供了类似的python接口，同时隐藏了直接使用CUDA编写GPU计算内核的复杂性。为了支持在多节点多GPU (MNMG)系统上执行机器学习工作负载，cuML库利用NVIDIA集体通信库(NCCL)作为进程之间集体通信的后端。另一方面，MPI是HPC系统中事实上的通信标准。在各种MPI库中，MVAPICH2-GDR是优化GPU通信的先驱。本文探讨了为gpu加速的cuML应用程序提供基于mpi的通信支持的各个方面和挑战。更具体地说，它提出了一个Python API，以便为cuML应用程序利用基于mpi的通信。它还对K-Means、最近邻、随机森林和tSVD等cuML算法进行了深入的分析、表征和基准测试。此外，本文还对基于mpi和基于nccl的通信算法进行了全面的性能评估和分析研究。评估结果表明，基于mpi的通信方法在最多32个gpu上对K-Means、最近邻、线性回归和tSVD分别实现了1.6倍、1.25倍、1.25倍和1.36倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Foundations and Trends in Machine Learning COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

108.50

自引率

0.00%

发文量

期刊介绍： Each issue of Foundations and Trends® in Machine Learning comprises a monograph of at least 50 pages written by research leaders in the field. We aim to publish monographs that provide an in-depth, self-contained treatment of topics where there have been significant new developments. Typically, this means that the monographs we publish will contain a significant level of mathematical detail (to describe the central methods and/or theory for the topic at hand), and will not eschew these details by simply pointing to existing references. Literature surveys and original research papers do not fall within these aims.