Analyzing GPU Tensor Core Potential for Fast Reductions

2018 37th International Conference of the Chilean Computer Science Society (SCCC) Pub Date : 2018-10-07 DOI:10.29007/zlmg

R. Carrasco, R. Vega, C. Navarro

引用次数: 11

Abstract

The Nvidia GPU architecture has introduced new computing elements such as the tensor cores, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate Deep Learning applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose a new GPU tensor-core based algorithm as well as analyze its potential performance benefits in comparison to a traditional GPU-based one. The proposed method, encodes the reduction of n numbers as a set of m × m MMA tensor-core operations (for Nvidia’s Volta architecture m = 16) and takes advantage from the fact that each MMA operation takes just one GPU cycle. When analyzing the cost under a simplified GPU computing model, the result is that the new algorithm manages to reduce a problem of n numbers in $T\left( n \right) = 5{\log _{{m^2}}}\left( n \right)$ steps with a speedup of $S = \frac{4}{5}{\log _2}\left( {{m^2}} \right)$.

查看原文本刊更多论文

分析GPU张量核心潜力的快速削减

Nvidia GPU架构引入了新的计算元素，如张量核，这是专门用于执行快速矩阵乘法累加(MMA)操作和加速深度学习应用程序的特殊处理单元。在这项工作中，我们提出了将张量核用于不同目的的想法，例如并行算术约简问题，并提出了一种新的基于GPU张量核的算法，并分析了与传统的基于GPU的算法相比，其潜在的性能优势。所提出的方法将n个数字的约简编码为一组m × m MMA张量核心操作(对于Nvidia的Volta架构m = 16)，并利用了每个MMA操作只需要一个GPU周期的事实。当在简化的GPU计算模型下分析成本时，结果是新算法设法在$T\left( n \right) = 5{\log _{{m^2}}}\left( n \right)$步中减少n个数字的问题，加速速度为$S = \frac{4}{5}{\log _2}\left( {{m^2}} \right)$。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 37th International Conference of the Chilean Computer Science Society (SCCC)

自引率

0.00%

发文量