{"title":"MAGC: GPU集群的映射方法","authors":"S. Mirsadeghi, Iman Faraji, A. Afsahi","doi":"10.1109/SBAC-PAD.2016.15","DOIUrl":null,"url":null,"abstract":"GPU accelerators have been increasingly used in modern heterogeneous HPC clusters by offering high performance and energy efficiency. Such heterogeneous GPU clusters consisting of multiple CPU cores and GPU devices have become the platform of choice for many HPC applications. The communication channels among these processing elements expose different latency and bandwidth characteristics. Thus, efficient utilization of communication channels becomes an important factor for achieving higher inter-process communication performance. In this paper, we exploit topology awareness for a better utilization of communication channels in GPU clusters. We first discuss the challenges associated with topology-aware mapping in GPU clusters, and then propose MAGC, a Mapping Approach for GPU Clusters. MAGC seeks to improve the total communication performance by a joint consideration of both CPU-to-CPU and GPU-to-GPU communications of the application, and CPU and GPU physical topologies of the underlying GPU cluster. It provides a unified framework for topology-aware process-to-core mapping and GPU-to-process assignment across a GPU cluster. We study the potential benefits of MAGC with two different mapping algorithms: a) the Scotch graph mapping library, and b) a heuristic designed to explicitly consider maximum congestion. We evaluate our design through extensive experiments at micro-benchmark and application levels on two GPU clusters with different GPU types and topologies. We have developed a micro-benchmark suite to model various communication patterns among CPU cores and among GPU devices. For application results, we use the molecular dynamics simulator, HOOMD-blue. Micro-benchmark results show that we can achieve up to 91.4% improvement in communication time. At the application level, we can achieve up to 8% performance improvement.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"MAGC: A Mapping Approach for GPU Clusters\",\"authors\":\"S. Mirsadeghi, Iman Faraji, A. Afsahi\",\"doi\":\"10.1109/SBAC-PAD.2016.15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"GPU accelerators have been increasingly used in modern heterogeneous HPC clusters by offering high performance and energy efficiency. Such heterogeneous GPU clusters consisting of multiple CPU cores and GPU devices have become the platform of choice for many HPC applications. The communication channels among these processing elements expose different latency and bandwidth characteristics. Thus, efficient utilization of communication channels becomes an important factor for achieving higher inter-process communication performance. In this paper, we exploit topology awareness for a better utilization of communication channels in GPU clusters. We first discuss the challenges associated with topology-aware mapping in GPU clusters, and then propose MAGC, a Mapping Approach for GPU Clusters. MAGC seeks to improve the total communication performance by a joint consideration of both CPU-to-CPU and GPU-to-GPU communications of the application, and CPU and GPU physical topologies of the underlying GPU cluster. It provides a unified framework for topology-aware process-to-core mapping and GPU-to-process assignment across a GPU cluster. We study the potential benefits of MAGC with two different mapping algorithms: a) the Scotch graph mapping library, and b) a heuristic designed to explicitly consider maximum congestion. We evaluate our design through extensive experiments at micro-benchmark and application levels on two GPU clusters with different GPU types and topologies. We have developed a micro-benchmark suite to model various communication patterns among CPU cores and among GPU devices. For application results, we use the molecular dynamics simulator, HOOMD-blue. Micro-benchmark results show that we can achieve up to 91.4% improvement in communication time. At the application level, we can achieve up to 8% performance improvement.\",\"PeriodicalId\":361160,\"journal\":{\"name\":\"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"volume\":\"272 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SBAC-PAD.2016.15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2016.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
GPU accelerators have been increasingly used in modern heterogeneous HPC clusters by offering high performance and energy efficiency. Such heterogeneous GPU clusters consisting of multiple CPU cores and GPU devices have become the platform of choice for many HPC applications. The communication channels among these processing elements expose different latency and bandwidth characteristics. Thus, efficient utilization of communication channels becomes an important factor for achieving higher inter-process communication performance. In this paper, we exploit topology awareness for a better utilization of communication channels in GPU clusters. We first discuss the challenges associated with topology-aware mapping in GPU clusters, and then propose MAGC, a Mapping Approach for GPU Clusters. MAGC seeks to improve the total communication performance by a joint consideration of both CPU-to-CPU and GPU-to-GPU communications of the application, and CPU and GPU physical topologies of the underlying GPU cluster. It provides a unified framework for topology-aware process-to-core mapping and GPU-to-process assignment across a GPU cluster. We study the potential benefits of MAGC with two different mapping algorithms: a) the Scotch graph mapping library, and b) a heuristic designed to explicitly consider maximum congestion. We evaluate our design through extensive experiments at micro-benchmark and application levels on two GPU clusters with different GPU types and topologies. We have developed a micro-benchmark suite to model various communication patterns among CPU cores and among GPU devices. For application results, we use the molecular dynamics simulator, HOOMD-blue. Micro-benchmark results show that we can achieve up to 91.4% improvement in communication time. At the application level, we can achieve up to 8% performance improvement.