MAGC: GPU集群的映射方法

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI:10.1109/SBAC-PAD.2016.15

S. Mirsadeghi, Iman Faraji, A. Afsahi

{"title":"MAGC: GPU集群的映射方法","authors":"S. Mirsadeghi, Iman Faraji, A. Afsahi","doi":"10.1109/SBAC-PAD.2016.15","DOIUrl":null,"url":null,"abstract":"GPU accelerators have been increasingly used in modern heterogeneous HPC clusters by offering high performance and energy efficiency. Such heterogeneous GPU clusters consisting of multiple CPU cores and GPU devices have become the platform of choice for many HPC applications. The communication channels among these processing elements expose different latency and bandwidth characteristics. Thus, efficient utilization of communication channels becomes an important factor for achieving higher inter-process communication performance. In this paper, we exploit topology awareness for a better utilization of communication channels in GPU clusters. We first discuss the challenges associated with topology-aware mapping in GPU clusters, and then propose MAGC, a Mapping Approach for GPU Clusters. MAGC seeks to improve the total communication performance by a joint consideration of both CPU-to-CPU and GPU-to-GPU communications of the application, and CPU and GPU physical topologies of the underlying GPU cluster. It provides a unified framework for topology-aware process-to-core mapping and GPU-to-process assignment across a GPU cluster. We study the potential benefits of MAGC with two different mapping algorithms: a) the Scotch graph mapping library, and b) a heuristic designed to explicitly consider maximum congestion. We evaluate our design through extensive experiments at micro-benchmark and application levels on two GPU clusters with different GPU types and topologies. We have developed a micro-benchmark suite to model various communication patterns among CPU cores and among GPU devices. For application results, we use the molecular dynamics simulator, HOOMD-blue. Micro-benchmark results show that we can achieve up to 91.4% improvement in communication time. At the application level, we can achieve up to 8% performance improvement.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"MAGC: A Mapping Approach for GPU Clusters\",\"authors\":\"S. Mirsadeghi, Iman Faraji, A. Afsahi\",\"doi\":\"10.1109/SBAC-PAD.2016.15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"GPU accelerators have been increasingly used in modern heterogeneous HPC clusters by offering high performance and energy efficiency. Such heterogeneous GPU clusters consisting of multiple CPU cores and GPU devices have become the platform of choice for many HPC applications. The communication channels among these processing elements expose different latency and bandwidth characteristics. Thus, efficient utilization of communication channels becomes an important factor for achieving higher inter-process communication performance. In this paper, we exploit topology awareness for a better utilization of communication channels in GPU clusters. We first discuss the challenges associated with topology-aware mapping in GPU clusters, and then propose MAGC, a Mapping Approach for GPU Clusters. MAGC seeks to improve the total communication performance by a joint consideration of both CPU-to-CPU and GPU-to-GPU communications of the application, and CPU and GPU physical topologies of the underlying GPU cluster. It provides a unified framework for topology-aware process-to-core mapping and GPU-to-process assignment across a GPU cluster. We study the potential benefits of MAGC with two different mapping algorithms: a) the Scotch graph mapping library, and b) a heuristic designed to explicitly consider maximum congestion. We evaluate our design through extensive experiments at micro-benchmark and application levels on two GPU clusters with different GPU types and topologies. We have developed a micro-benchmark suite to model various communication patterns among CPU cores and among GPU devices. For application results, we use the molecular dynamics simulator, HOOMD-blue. Micro-benchmark results show that we can achieve up to 91.4% improvement in communication time. At the application level, we can achieve up to 8% performance improvement.\",\"PeriodicalId\":361160,\"journal\":{\"name\":\"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"volume\":\"272 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SBAC-PAD.2016.15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2016.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

GPU加速器通过提供高性能和能效，在现代异构HPC集群中得到越来越多的应用。这种由多个CPU内核和GPU设备组成的异构GPU集群已经成为许多高性能计算应用的首选平台。这些处理元素之间的通信通道暴露了不同的延迟和带宽特性。因此，通信通道的有效利用成为实现更高进程间通信性能的重要因素。在本文中，我们利用拓扑感知来更好地利用GPU集群中的通信通道。我们首先讨论了GPU集群中与拓扑感知映射相关的挑战，然后提出了GPU集群的映射方法MAGC。MAGC寻求通过联合考虑应用程序的CPU到CPU和GPU到GPU通信以及底层GPU集群的CPU和GPU物理拓扑来提高总通信性能。它提供了一个统一的框架，用于拓扑感知的进程到核心映射和跨GPU集群的GPU到进程分配。我们研究了MAGC与两种不同映射算法的潜在好处:a) Scotch图映射库，b)设计用于显式考虑最大拥塞的启发式算法。我们通过在两个具有不同GPU类型和拓扑的GPU集群上进行微基准测试和应用级别的广泛实验来评估我们的设计。我们已经开发了一个微基准套件来模拟CPU内核和GPU设备之间的各种通信模式。为了获得应用结果，我们使用了分子动力学模拟器HOOMD-blue。微基准测试结果表明，该方法可使通信时间提高91.4%。在应用程序级别，我们可以实现高达8%的性能改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MAGC: A Mapping Approach for GPU Clusters

GPU accelerators have been increasingly used in modern heterogeneous HPC clusters by offering high performance and energy efficiency. Such heterogeneous GPU clusters consisting of multiple CPU cores and GPU devices have become the platform of choice for many HPC applications. The communication channels among these processing elements expose different latency and bandwidth characteristics. Thus, efficient utilization of communication channels becomes an important factor for achieving higher inter-process communication performance. In this paper, we exploit topology awareness for a better utilization of communication channels in GPU clusters. We first discuss the challenges associated with topology-aware mapping in GPU clusters, and then propose MAGC, a Mapping Approach for GPU Clusters. MAGC seeks to improve the total communication performance by a joint consideration of both CPU-to-CPU and GPU-to-GPU communications of the application, and CPU and GPU physical topologies of the underlying GPU cluster. It provides a unified framework for topology-aware process-to-core mapping and GPU-to-process assignment across a GPU cluster. We study the potential benefits of MAGC with two different mapping algorithms: a) the Scotch graph mapping library, and b) a heuristic designed to explicitly consider maximum congestion. We evaluate our design through extensive experiments at micro-benchmark and application levels on two GPU clusters with different GPU types and topologies. We have developed a micro-benchmark suite to model various communication patterns among CPU cores and among GPU devices. For application results, we use the molecular dynamics simulator, HOOMD-blue. Micro-benchmark results show that we can achieve up to 91.4% improvement in communication time. At the application level, we can achieve up to 8% performance improvement.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

自引率

0.00%

发文量