基于CUDA内核的大规模GPU集群集体约简操作

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) Pub Date : 2016-05-16 DOI:10.1109/CCGrid.2016.111

Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, A. Awan, D. Panda

{"title":"基于CUDA内核的大规模GPU集群集体约简操作","authors":"Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, A. Awan, D. Panda","doi":"10.1109/CCGrid.2016.111","DOIUrl":null,"url":null,"abstract":"Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"75 15","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters\",\"authors\":\"Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, A. Awan, D. Panda\",\"doi\":\"10.1109/CCGrid.2016.111\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.\",\"PeriodicalId\":103641,\"journal\":{\"name\":\"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)\",\"volume\":\"75 15\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGrid.2016.111\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2016.111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

像NVIDIA gpu这样的加速器在很大程度上改变了当前HPC集群的格局。这些加速器提供的大量异构并行性导致了gpu感知的MPI库，广泛用于编写分布式并行科学应用程序。面向计算的集体操作，如MPI_Reduce，除了集体执行通常的通信之外，还对数据执行计算。从历史上看，由于它们的计算需求，这些集合只能在CPU(或主机)上实现。然而，随着GPU技术的出现，MPI库为基于GPU(或设备)的版本提供更好的设计变得非常重要。在本文中，我们解决了上述挑战，并为GPU集群提供了最常用的面向计算的集合- MPI_Reduce, MPI_Allreduce和MPI_Scan的设计和实现。我们建议对最先进的算法进行扩展，以充分利用GPU功能，如GPUDirect RDMA (GDR)和CUDA计算内核，以有效地执行这些操作。使用我们的新设计，我们报告减少了所有基于计算机的集群(最多96个gpu)的执行时间。实验结果表明，使用MPI_Reduce处理小消息的效率提高了50%，处理大消息的效率提高了85%。对于MPI_Allreduce和MPI_Scan，我们报告大消息的处理时间减少了40%以上。此外，还开发和评估了分析模型，以了解和预测超大规模GPU集群的拟议设计的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

自引率

0.00%

发文量