A Flexible and Portable Large-Scale DGEMM Library for Linpack on Next-Generation Multi-GPU Systems

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Pub Date : 2015-03-04 DOI:10.1109/PDP.2015.89

D. Rohr, V. Lindenstruth

引用次数: 10

Abstract

In recent years, high performance computing has benefitted greatly from special accelerator cards such as GPUs. Matrix multiplication performed by the BLAS function DGEMM is one of the prime examples where such accelerators excel. DGEMM is the computational hotspot of many tasks, among them the Linpack benchmark. Current GPUs achieve more than 1 TFLOPS real performance in this task. Being connected via PCI Express, one can easily install multiple GPUs in a single compute node. This enables the construction of multi-TFLOPS systems out of off-the-shelf components. At such high performance, it is often complicated to feed the GPUs with sufficient data to run at full performance. In this paper we first analyze the scalability of our DGEMM implementation for multiple fast GPUs. Then we suggest a new scheme optimized for this situation and we present an implementation.

查看原文本刊更多论文

面向下一代多gpu系统的Linpack灵活可移植的大规模DGEMM库

近年来，高性能计算从gpu等特殊加速卡中受益匪浅。由BLAS函数DGEMM执行的矩阵乘法是这种加速器擅长的主要示例之一。DGEMM是许多任务的计算热点，其中包括Linpack基准测试。当前gpu在本任务中实际性能达到1tflops以上。通过PCI Express连接，可以轻松地在单个计算节点上安装多个gpu。这样就可以用现成的组件构建多tflops系统。在如此高的性能下，为gpu提供足够的数据以充分发挥性能通常是很复杂的。在本文中，我们首先分析了我们的DGEMM实现在多个快速gpu上的可扩展性。针对这种情况，提出了一种新的优化方案，并给出了实现方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

自引率

0.00%

发文量