Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing最新文献

Investigating automatic vectorization for real-time 3D scene understanding 研究实时三维场景理解的自动矢量化

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing Pub Date : 2018-02-24 DOI: 10.1145/3178433.3178438

A. Nica, E. Vespa, Pablo González de Aledo Marugán, P. Kelly

引用次数: 0

A Data Layout Transformation for Vectorizing Compilers 面向向量化编译器的数据布局转换

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing Pub Date : 2018-02-24 DOI: 10.1145/3178433.3178440

Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Simon Moll, Roland Leißa, Sebastian Hack

{"title":"A Data Layout Transformation for Vectorizing Compilers","authors":"Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Simon Moll, Roland Leißa, Sebastian Hack","doi":"10.1145/3178433.3178440","DOIUrl":"https://doi.org/10.1145/3178433.3178440","url":null,"abstract":"Modern processors are often equipped with vector instruction sets. Such instructions operate on multiple elements of data at once, and greatly improve performance for specific applications. A programmer has two options to take advantage of these instructions: writing manually vectorized code, or using an auto-vectorizing compiler. In the latter case, he only has to place annotations to instruct the auto-vectorizing compiler to vectorize a particular piece of code. Thanks to auto-vectorization, the source program remains portable, and the programmer can focus on the task at hand instead of the low-level details of intrinsics programming. However, the performance of the vectorized program strongly depends on the precision of the analyses performed by the vectorizing compiler. In this paper, we improve the precision of these analyses by selectively splitting stack-allocated variables of a structure or aggregate type. Without this optimization, automatic vectorization slows the execution down compared to the scalar, non-vectorized code. When this optimization is enabled, we show that the vectorized code can be as fast as hand-optimized, manually vectorized implementations.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131204298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Small SIMD Matrices for CERN High Throughput Computing 用于CERN高通量计算的小型SIMD矩阵

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing Pub Date : 2018-02-24 DOI: 10.1145/3178433.3178434

F. Lemaitre, Benjamin Couturier, L. Lacassagne

引用次数: 4

MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard MIPP:可移植c++ SIMD包装器及其在5G标准中纠错编码中的应用

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing Pub Date : 2018-02-24 DOI: 10.1145/3178433.3178435

Adrien Cassagne, Olivier Aumage, Denis Barthou, Camille Leroux, C. Jégo

引用次数: 14

SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors 宽SIMD矢量处理器的小张量乘法核的sim化

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing Pub Date : 2018-02-24 DOI: 10.1145/3178433.3178436

Christopher I. Rodrigues, Amarin Phaosawasdi, Peng Wu

{"title":"SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors","authors":"Christopher I. Rodrigues, Amarin Phaosawasdi, Peng Wu","doi":"10.1145/3178433.3178436","DOIUrl":"https://doi.org/10.1145/3178433.3178436","url":null,"abstract":"Developers often rely on automatic vectorization to speed up fine-grained data-parallel code. However, for loop nests where the loops are shorter than the processor's SIMD width, automatic vectorization performs poorly. Vectorizers attempt to vectorize a single short loop, using (at best) a fraction of the processor's SIMD capacity. It is not straightforward to vectorize multiple nested loops together because they typically have memory accesses with multiple strides, which conventional methods cannot profitably vectorize. We present a solution in the context of compiling small tensor multiplication. Our compiler vectorizes several inner loops in order to utilize wide vector parallelism. To handle complicated strides, we devise a vectorizable form of loop tiling. The compiler transforms loops to improve memory locality, then caches tiles of data in vector registers. Strided access patterns are transformed into permute instructions. We show that our compiler is able to significantly speed up many small tensor multiplication algorithms. It judges 13.5% of a randomly generated sample of algorithms to be profitable to vectorize. On these, it generates code 1.55x as fast on average as that produced by GCC's state-of-the-art vectorizer, with a maximum speedup of 10x. We discuss potential extensions to vectorize more general algorithms.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131663963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Vectorization of a spectral finite-element numerical kernel 谱有限元数值核的矢量化

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing Pub Date : 2018-02-24 DOI: 10.1145/3178433.3178441

S. Jubertie, F. Dupros, F. D. Martin

{"title":"Vectorization of a spectral finite-element numerical kernel","authors":"S. Jubertie, F. Dupros, F. D. Martin","doi":"10.1145/3178433.3178441","DOIUrl":"https://doi.org/10.1145/3178433.3178441","url":null,"abstract":"In this paper, we present an optimized implementation of the Finite-Element Methods numerical kernel for SIMD vectorization. A typical application is the modelling of seismic wave propagation. In this case, the computations at the element level are generally based on nested loops where the memory accesses are non-contiguous. Moreover, the back and forth from the element level to the global level (e.g., assembly phase) is a serious brake for automatic vectorization by compilers and for efficient reuse of data at the cache memory levels. This is particularly true when the problem under study relies on an unstructured mesh. The application proxies used for our experiments were extracted from EFISPEC code that implements the spectral finite-element method to solve the elastodynamic equations. We underline that the intra-node performance may be further improved. Additionally, we show that standard compilers such as GNU GCC, Clang and Intel ICC are unable to perform automatic vectorization even when the nested loops were reorganized or when SIMD pragmas were added. Due to the irregular memory access pattern, we introduce a dedicated strategy to squeeze the maximum performance out of the SIMD units. Experiments are carried out on Intel Broadwell and Skylake platforms that respectively offer AVX2 and AVX-512 SIMD units. We believe that our vectorization approach may be generic enough to be adapted to other codes.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123651208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Usuba: Optimizing & Trustworthy Bitslicing Compiler Usuba:优化和值得信赖的位切片编译器

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing Pub Date : 2018-02-24 DOI: 10.1145/3178433.3178437

Darius Mercadier, Pierre-Évariste Dagand, L. Lacassagne, Gilles Muller

{"title":"Usuba: Optimizing & Trustworthy Bitslicing Compiler","authors":"Darius Mercadier, Pierre-Évariste Dagand, L. Lacassagne, Gilles Muller","doi":"10.1145/3178433.3178437","DOIUrl":"https://doi.org/10.1145/3178433.3178437","url":null,"abstract":"Bitslicing is a programming technique commonly used in cryptography that consists in implementing a combinational circuit in software. It results in a massively parallel program immune to cache-timing attacks by design. However, writing a program in bitsliced form requires extreme minutia. This paper introduces Usuba, a synchronous dataflow language producing bitsliced C code. Usuba is both a domain-specific language -- providing syntactic support for the implementation of cryptographic algorithms -- as well as a domain-specific compiler -- taking advantage of well-defined semantics invariants to perform various optimizations before handing the generated code to an (optimizing) C compiler. On the Data Encryption Standard (DES) algorithm, we show that Usuba outperforms a reference, hand-tuned implementation by 15% (using Intel's 64 bits general-purpose registers and depending on the underlying C compiler) whilst our implementation also transparently supports modern SIMD extensions (SSE, AVX, AVX-512), other architectures (ARM Neon, IBM Altivec) as well as multicore processors through an OpenMP backend.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127877273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout Ikra-Cpp:一个c++ /CUDA面向对象编程的数组结构布局DSL

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing Pub Date : 2018-02-24 DOI: 10.1145/3178433.3178439

M. Springer, H. Masuhara

引用次数: 7