Autotuning Tensor Transposition

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI:10.1109/IPDPSW.2014.43

Lai Wei, J. Mellor-Crummey

引用次数: 9

Abstract

Tensor transposition, a generalization of matrix transposition, is an important primitive used when performing tensor contraction. Efficient implementation of tensor transposition for modern node architectures depends on various architecture capabilities such as cache and memory hierarchy, threads, and SIMD parallelism. This paper introduces a framework that uses static analysis and empirical autotuning to produce optimized parallel tensor transposition code for node architectures using a rule-based code generation and transformation system. By exploring various optimization techniques with different settings, our framework achieves more than 80% of the bandwidth of memcpy for tensors on two very different node architectures, one a dual-socket system with Intel Westmere processors and the other a quad-socket system with IBM Power7 processors.

查看原文本刊更多论文

自调谐张量变换

张量转置是矩阵转置的推广，是进行张量收缩时用到的一个重要原语。现代节点架构中张量变换的有效实现依赖于各种架构功能，如缓存和内存层次结构、线程和SIMD并行性。本文介绍了一个使用静态分析和经验自调整的框架，该框架使用基于规则的代码生成和转换系统为节点架构生成优化的并行张量转置代码。通过探索不同设置下的各种优化技术，我们的框架在两种非常不同的节点体系结构上为张量实现了超过80%的memcpy带宽，一种是带有Intel Westmere处理器的双插槽系统，另一种是带有IBM Power7处理器的四插槽系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE International Parallel & Distributed Processing Symposium Workshops

自引率

0.00%

发文量