使用MPI库在Python中加速基于gpu的机器学习:MVAPICH2-GDR的案例研究

IF 65.3 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
S. M. Ghazimirsaeed, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda
{"title":"使用MPI库在Python中加速基于gpu的机器学习:MVAPICH2-GDR的案例研究","authors":"S. M. Ghazimirsaeed, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/MLHPCAI4S51975.2020.00010","DOIUrl":null,"url":null,"abstract":"The growth of big data applications during the last decade has led to a surge in the deployment and popularity of machine learning (ML) libraries. On the other hand, the high performance offered by GPUs makes them well suited for ML problems. To take advantage of GPU performance for ML, NVIDIA has recently developed the cuML library. cuML is the GPU counterpart of Scikit-learn, and provides similar Pythonic interfaces to Scikit-learn while hiding the complexities of writing GPU compute kernels directly using CUDA. To support execution of ML workloads on Multi-Node Multi- GPU (MNMG) systems, the cuML library exploits the NVIDIA Collective Communications Library (NCCL) as a backend for collective communications between processes. On the other hand, MPI is a de facto standard for communication in HPC systems. Among various MPI libraries, MVAPICH2-GDR is the pioneer in optimizing GPU communication.This paper explores various aspects and challenges of providing MPI-based communication support for GPU-accelerated cuML applications. More specifically, it proposes a Python API to take advantage of MPI-based communications for cuML applications. It also gives an in-depth analysis, characterization, and benchmarking of the cuML algorithms such as K-Means, Nearest Neighbors, Random Forest, and tSVD. Moreover, it provides a comprehensive performance evaluation and profiling study for MPI-based versus NCCL-based communication for these algorithms. The evaluation results show that the proposed MPI-based communication approach achieves up to 1.6x, 1.25x, 1.25x, and 1.36x speedup for K-Means, Nearest Neighbors, Linear Regression, and tSVD, respectively on up to 32 GPUs.","PeriodicalId":47667,"journal":{"name":"Foundations and Trends in Machine Learning","volume":"31 1","pages":"1-12"},"PeriodicalIF":65.3000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR\",\"authors\":\"S. M. Ghazimirsaeed, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda\",\"doi\":\"10.1109/MLHPCAI4S51975.2020.00010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The growth of big data applications during the last decade has led to a surge in the deployment and popularity of machine learning (ML) libraries. On the other hand, the high performance offered by GPUs makes them well suited for ML problems. To take advantage of GPU performance for ML, NVIDIA has recently developed the cuML library. cuML is the GPU counterpart of Scikit-learn, and provides similar Pythonic interfaces to Scikit-learn while hiding the complexities of writing GPU compute kernels directly using CUDA. To support execution of ML workloads on Multi-Node Multi- GPU (MNMG) systems, the cuML library exploits the NVIDIA Collective Communications Library (NCCL) as a backend for collective communications between processes. On the other hand, MPI is a de facto standard for communication in HPC systems. Among various MPI libraries, MVAPICH2-GDR is the pioneer in optimizing GPU communication.This paper explores various aspects and challenges of providing MPI-based communication support for GPU-accelerated cuML applications. More specifically, it proposes a Python API to take advantage of MPI-based communications for cuML applications. It also gives an in-depth analysis, characterization, and benchmarking of the cuML algorithms such as K-Means, Nearest Neighbors, Random Forest, and tSVD. Moreover, it provides a comprehensive performance evaluation and profiling study for MPI-based versus NCCL-based communication for these algorithms. The evaluation results show that the proposed MPI-based communication approach achieves up to 1.6x, 1.25x, 1.25x, and 1.36x speedup for K-Means, Nearest Neighbors, Linear Regression, and tSVD, respectively on up to 32 GPUs.\",\"PeriodicalId\":47667,\"journal\":{\"name\":\"Foundations and Trends in Machine Learning\",\"volume\":\"31 1\",\"pages\":\"1-12\"},\"PeriodicalIF\":65.3000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Foundations and Trends in Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MLHPCAI4S51975.2020.00010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foundations and Trends in Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLHPCAI4S51975.2020.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

在过去十年中,大数据应用的增长导致机器学习(ML)库的部署和普及激增。另一方面,gpu提供的高性能使它们非常适合ML问题。为了利用GPU在ML中的性能优势,NVIDIA最近开发了cuML库。cuML是Scikit-learn的GPU版本,为Scikit-learn提供了类似的python接口,同时隐藏了直接使用CUDA编写GPU计算内核的复杂性。为了支持在多节点多GPU (MNMG)系统上执行机器学习工作负载,cuML库利用NVIDIA集体通信库(NCCL)作为进程之间集体通信的后端。另一方面,MPI是HPC系统中事实上的通信标准。在各种MPI库中,MVAPICH2-GDR是优化GPU通信的先驱。本文探讨了为gpu加速的cuML应用程序提供基于mpi的通信支持的各个方面和挑战。更具体地说,它提出了一个Python API,以便为cuML应用程序利用基于mpi的通信。它还对K-Means、最近邻、随机森林和tSVD等cuML算法进行了深入的分析、表征和基准测试。此外,本文还对基于mpi和基于nccl的通信算法进行了全面的性能评估和分析研究。评估结果表明,基于mpi的通信方法在最多32个gpu上对K-Means、最近邻、线性回归和tSVD分别实现了1.6倍、1.25倍、1.25倍和1.36倍的加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR
The growth of big data applications during the last decade has led to a surge in the deployment and popularity of machine learning (ML) libraries. On the other hand, the high performance offered by GPUs makes them well suited for ML problems. To take advantage of GPU performance for ML, NVIDIA has recently developed the cuML library. cuML is the GPU counterpart of Scikit-learn, and provides similar Pythonic interfaces to Scikit-learn while hiding the complexities of writing GPU compute kernels directly using CUDA. To support execution of ML workloads on Multi-Node Multi- GPU (MNMG) systems, the cuML library exploits the NVIDIA Collective Communications Library (NCCL) as a backend for collective communications between processes. On the other hand, MPI is a de facto standard for communication in HPC systems. Among various MPI libraries, MVAPICH2-GDR is the pioneer in optimizing GPU communication.This paper explores various aspects and challenges of providing MPI-based communication support for GPU-accelerated cuML applications. More specifically, it proposes a Python API to take advantage of MPI-based communications for cuML applications. It also gives an in-depth analysis, characterization, and benchmarking of the cuML algorithms such as K-Means, Nearest Neighbors, Random Forest, and tSVD. Moreover, it provides a comprehensive performance evaluation and profiling study for MPI-based versus NCCL-based communication for these algorithms. The evaluation results show that the proposed MPI-based communication approach achieves up to 1.6x, 1.25x, 1.25x, and 1.36x speedup for K-Means, Nearest Neighbors, Linear Regression, and tSVD, respectively on up to 32 GPUs.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Foundations and Trends in Machine Learning
Foundations and Trends in Machine Learning COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-
CiteScore
108.50
自引率
0.00%
发文量
5
期刊介绍: Each issue of Foundations and Trends® in Machine Learning comprises a monograph of at least 50 pages written by research leaders in the field. We aim to publish monographs that provide an in-depth, self-contained treatment of topics where there have been significant new developments. Typically, this means that the monographs we publish will contain a significant level of mathematical detail (to describe the central methods and/or theory for the topic at hand), and will not eschew these details by simply pointing to existing references. Literature surveys and original research papers do not fall within these aims.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信