StVEC: A Vector Instruction Extension for High Performance Stencil Computation

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI:10.1109/PACT.2011.59

N. Sedaghati, Renji Thomas, L. Pouchet, R. Teodorescu, P. Sadayappan

{"title":"StVEC: A Vector Instruction Extension for High Performance Stencil Computation","authors":"N. Sedaghati, Renji Thomas, L. Pouchet, R. Teodorescu, P. Sadayappan","doi":"10.1109/PACT.2011.59","DOIUrl":null,"url":null,"abstract":"Stencil computations comprise the compute-intensive core of many scientific applications. The data access pattern of stencil computations often requires several adjacent data elements of arrays to be accessed in innermost parallel loops. Although such loops are vectorized by current compilers like GCC and ICC that target short-vector SIMD instruction sets, a number of redundant loads or additional intra-register data shuffle operations are required, reducing the achievable performance. Thus, even when all arrays are cache resident, the peak performance achieved with stencil computations is considerably lower than machine peak. In this paper, we present a hardware-based solution for this problem. We propose an extension to the standard addressing mode of vector floating-point instructions in ISAs such as SSE, AVX, VMX etc. We propose an extended mode of paired-register addressing and its hardware implementation, to overcome the performance limitation of current short-vector SIMD ISA's for stencil computations. Further, we present a code generation approach that can be used by a vectorizing compiler for processors with such an instructions set. Using an optimistic as well as a pessimistic emulation of the proposed instruction extension, we demonstrate the effectiveness of the proposed approach on top of SSE and AVX capable processors. We also synthesize parts of the proposed design using a 45nm CMOS library and show minimal impact on processor cycle time.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2011.59","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Stencil computations comprise the compute-intensive core of many scientific applications. The data access pattern of stencil computations often requires several adjacent data elements of arrays to be accessed in innermost parallel loops. Although such loops are vectorized by current compilers like GCC and ICC that target short-vector SIMD instruction sets, a number of redundant loads or additional intra-register data shuffle operations are required, reducing the achievable performance. Thus, even when all arrays are cache resident, the peak performance achieved with stencil computations is considerably lower than machine peak. In this paper, we present a hardware-based solution for this problem. We propose an extension to the standard addressing mode of vector floating-point instructions in ISAs such as SSE, AVX, VMX etc. We propose an extended mode of paired-register addressing and its hardware implementation, to overcome the performance limitation of current short-vector SIMD ISA's for stencil computations. Further, we present a code generation approach that can be used by a vectorizing compiler for processors with such an instructions set. Using an optimistic as well as a pessimistic emulation of the proposed instruction extension, we demonstrate the effectiveness of the proposed approach on top of SSE and AVX capable processors. We also synthesize parts of the proposed design using a 45nm CMOS library and show minimal impact on processor cycle time.

查看原文本刊更多论文

StVEC:高性能模板计算的矢量指令扩展

模板计算构成了许多科学应用的计算密集型核心。模板计算的数据访问模式通常需要在最内层并行循环中访问数组的几个相邻数据元素。虽然这样的循环是由GCC和ICC等针对短向量SIMD指令集的当前编译器向量化的，但需要大量冗余负载或额外的寄存器内数据shuffle操作，从而降低了可实现的性能。因此，即使所有数组都驻留在缓存中，通过模板计算实现的峰值性能也远低于机器峰值。在本文中，我们提出了一种基于硬件的解决方案。我们提出了对isa(如SSE、AVX、VMX等)中矢量浮点指令的标准寻址模式的扩展。我们提出了一种扩展的配对寄存器寻址模式及其硬件实现，以克服当前用于模板计算的短向量SIMD ISA的性能限制。此外，我们提出了一种代码生成方法，该方法可用于具有这种指令集的处理器的向量化编译器。通过对所提出的指令扩展进行乐观和悲观的仿真，我们证明了所提出的方法在SSE和AVX支持处理器上的有效性。我们还使用45纳米CMOS库合成了部分拟议设计，并显示对处理器周期时间的影响最小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 International Conference on Parallel Architectures and Compilation Techniques

自引率

0.00%

发文量