GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI:10.1145/2802658.2802672

A. Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan L. Perkins, H. Subramoni, D. Panda

{"title":"GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks","authors":"A. Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan L. Perkins, H. Subramoni, D. Panda","doi":"10.1145/2802658.2802672","DOIUrl":null,"url":null,"abstract":"As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important trend, especially with the introduction of non-blocking collective communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware Non-Blocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware benchmark and discuss the challenges associated with identifying and implementing performance parameters like overlap, latency, effect of MPI_Test() calls to progress communication, effect of independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate the efficacy of the proposed benchmark, we provide a comparative performance evaluation of GPU-Aware Non-Blocking Collectives in MVAPICH2 and OpenMPI.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2802658.2802672","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important trend, especially with the introduction of non-blocking collective communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware Non-Blocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware benchmark and discuss the challenges associated with identifying and implementing performance parameters like overlap, latency, effect of MPI_Test() calls to progress communication, effect of independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate the efficacy of the proposed benchmark, we provide a comparative performance evaluation of GPU-Aware Non-Blocking Collectives in MVAPICH2 and OpenMPI.

查看原文本刊更多论文

非阻塞集体基准的gpu感知设计，实现和评估

随着我们向高效的百亿亿级系统迈进，像NVIDIA gpu这样的异构加速器正在成为现代HPC集群的重要计算组件。利用系统中可用的每个计算设备的每个周期变得非常重要。从网卡到gpu再到协处理器，异构计算资源是向前发展的方向。另一个重要的趋势，特别是在最新的MPI标准中引入了非阻塞集体通信，是通信与计算的重叠。它已经成为MVAPICH2和OpenMPI等消息传递库的重要设计目标。在本文中，我们提出了一个重要的基准，允许不同MPI库的用户评估gpu感知非阻塞集合的性能。主要的性能指标是重叠和延迟。我们提供了关于设计GPU感知基准的见解，并讨论了与识别和实现性能参数相关的挑战，例如重叠，延迟，MPI_Test()调用对进度通信的影响，在通信下进行重叠计算时独立GPU通信的影响，以及这种重叠计算的复杂性，目标和规模的影响。为了说明所提出的基准的有效性，我们在MVAPICH2和OpenMPI中提供了gpu感知非阻塞集体的比较性能评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 22nd European MPI Users' Group Meeting

自引率

0.00%

发文量