A. Nica, E. Vespa, Pablo González de Aledo Marugán, P. Kelly
{"title":"Investigating automatic vectorization for real-time 3D scene understanding","authors":"A. Nica, E. Vespa, Pablo González de Aledo Marugán, P. Kelly","doi":"10.1145/3178433.3178438","DOIUrl":"https://doi.org/10.1145/3178433.3178438","url":null,"abstract":"Simultaneous Localization And Mapping (SLAM) is the problem of building a representation of a geometric space while simultaneously estimating the observer's location within the space. While this seems to be a chicken-and-egg problem, several algorithms have appeared in the last decades that approximately and iteratively solve this problem. SLAM algorithms are tailored to the available resources, hence aimed at balancing the precision of the map with the constraints that the computational platform imposes and the desire to obtain real-time results. Working with KinectFusion, an established SLAM implementation, we explore in this work the vectorization opportunities present in this scenario, with the goal of using the CPU to its full potential. Using ISPC, an automatic vectorization tool, we produce a partially vectorized version of KinectFusion. Along the way we explore a number of optimization strategies, among which tiling to exploit ray-coherence and outer loop vectorization, obtaining up to 4x speed-up over the baseline on an 8-wide vector machine.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123463912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Simon Moll, Roland Leißa, Sebastian Hack
{"title":"A Data Layout Transformation for Vectorizing Compilers","authors":"Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Simon Moll, Roland Leißa, Sebastian Hack","doi":"10.1145/3178433.3178440","DOIUrl":"https://doi.org/10.1145/3178433.3178440","url":null,"abstract":"Modern processors are often equipped with vector instruction sets. Such instructions operate on multiple elements of data at once, and greatly improve performance for specific applications. A programmer has two options to take advantage of these instructions: writing manually vectorized code, or using an auto-vectorizing compiler. In the latter case, he only has to place annotations to instruct the auto-vectorizing compiler to vectorize a particular piece of code. Thanks to auto-vectorization, the source program remains portable, and the programmer can focus on the task at hand instead of the low-level details of intrinsics programming. However, the performance of the vectorized program strongly depends on the precision of the analyses performed by the vectorizing compiler. In this paper, we improve the precision of these analyses by selectively splitting stack-allocated variables of a structure or aggregate type. Without this optimization, automatic vectorization slows the execution down compared to the scalar, non-vectorized code. When this optimization is enabled, we show that the vectorized code can be as fast as hand-optimized, manually vectorized implementations.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131204298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Small SIMD Matrices for CERN High Throughput Computing","authors":"F. Lemaitre, Benjamin Couturier, L. Lacassagne","doi":"10.1145/3178433.3178434","DOIUrl":"https://doi.org/10.1145/3178433.3178434","url":null,"abstract":"System tracking is an old problem and has been heavily optimized throughout the past. However, in High Energy Physics, many small systems are tracked in real-time using Kalman filtering and no implementation satisfying those constraints currently exists. In this paper, we present a code generator used to speed up Cholesky Factorization and Kalman Filter for small matrices. The generator is easy to use and produces portable and heavily optimized code. We focus on current SIMD architectures (SSE, AVX, AVX512, Neon, SVE, Altivec and VSX). Our Cholesky factorization outperforms any existing libraries: from x3 to x10 faster than MKL. The Kalman Filter is also faster than existing implementations, and achieves 4 · 109 iter/s on a 2x24C Intel Xeon.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132451570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrien Cassagne, Olivier Aumage, Denis Barthou, Camille Leroux, C. Jégo
{"title":"MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard","authors":"Adrien Cassagne, Olivier Aumage, Denis Barthou, Camille Leroux, C. Jégo","doi":"10.1145/3178433.3178435","DOIUrl":"https://doi.org/10.1145/3178433.3178435","url":null,"abstract":"Error correction code (ECC) processing has so far been performed on dedicated hardware for previous generations of mobile communication standards, to meet latency and bandwidth constraints. As the 5G mobile standard, and its associated channel coding algorithms, are now being specified, modern CPUs are progressing to the point where software channel decoders can viably be contemplated. A key aspect in reaching this transition point is to get the most of CPUs SIMD units on the decoding algorithms being pondered for 5G mobile standards. The nature and diversity of such algorithms requires highly versatile programming tools. This paper demonstrates the virtues and versatility of our MIPP SIMD wrapper in implementing a high performance portfolio of key ECC decoding algorithms.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127866584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher I. Rodrigues, Amarin Phaosawasdi, Peng Wu
{"title":"SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors","authors":"Christopher I. Rodrigues, Amarin Phaosawasdi, Peng Wu","doi":"10.1145/3178433.3178436","DOIUrl":"https://doi.org/10.1145/3178433.3178436","url":null,"abstract":"Developers often rely on automatic vectorization to speed up fine-grained data-parallel code. However, for loop nests where the loops are shorter than the processor's SIMD width, automatic vectorization performs poorly. Vectorizers attempt to vectorize a single short loop, using (at best) a fraction of the processor's SIMD capacity. It is not straightforward to vectorize multiple nested loops together because they typically have memory accesses with multiple strides, which conventional methods cannot profitably vectorize. We present a solution in the context of compiling small tensor multiplication. Our compiler vectorizes several inner loops in order to utilize wide vector parallelism. To handle complicated strides, we devise a vectorizable form of loop tiling. The compiler transforms loops to improve memory locality, then caches tiles of data in vector registers. Strided access patterns are transformed into permute instructions. We show that our compiler is able to significantly speed up many small tensor multiplication algorithms. It judges 13.5% of a randomly generated sample of algorithms to be profitable to vectorize. On these, it generates code 1.55x as fast on average as that produced by GCC's state-of-the-art vectorizer, with a maximum speedup of 10x. We discuss potential extensions to vectorize more general algorithms.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131663963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vectorization of a spectral finite-element numerical kernel","authors":"S. Jubertie, F. Dupros, F. D. Martin","doi":"10.1145/3178433.3178441","DOIUrl":"https://doi.org/10.1145/3178433.3178441","url":null,"abstract":"In this paper, we present an optimized implementation of the Finite-Element Methods numerical kernel for SIMD vectorization. A typical application is the modelling of seismic wave propagation. In this case, the computations at the element level are generally based on nested loops where the memory accesses are non-contiguous. Moreover, the back and forth from the element level to the global level (e.g., assembly phase) is a serious brake for automatic vectorization by compilers and for efficient reuse of data at the cache memory levels. This is particularly true when the problem under study relies on an unstructured mesh. The application proxies used for our experiments were extracted from EFISPEC code that implements the spectral finite-element method to solve the elastodynamic equations. We underline that the intra-node performance may be further improved. Additionally, we show that standard compilers such as GNU GCC, Clang and Intel ICC are unable to perform automatic vectorization even when the nested loops were reorganized or when SIMD pragmas were added. Due to the irregular memory access pattern, we introduce a dedicated strategy to squeeze the maximum performance out of the SIMD units. Experiments are carried out on Intel Broadwell and Skylake platforms that respectively offer AVX2 and AVX-512 SIMD units. We believe that our vectorization approach may be generic enough to be adapted to other codes.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123651208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Darius Mercadier, Pierre-Évariste Dagand, L. Lacassagne, Gilles Muller
{"title":"Usuba: Optimizing & Trustworthy Bitslicing Compiler","authors":"Darius Mercadier, Pierre-Évariste Dagand, L. Lacassagne, Gilles Muller","doi":"10.1145/3178433.3178437","DOIUrl":"https://doi.org/10.1145/3178433.3178437","url":null,"abstract":"Bitslicing is a programming technique commonly used in cryptography that consists in implementing a combinational circuit in software. It results in a massively parallel program immune to cache-timing attacks by design. However, writing a program in bitsliced form requires extreme minutia. This paper introduces Usuba, a synchronous dataflow language producing bitsliced C code. Usuba is both a domain-specific language -- providing syntactic support for the implementation of cryptographic algorithms -- as well as a domain-specific compiler -- taking advantage of well-defined semantics invariants to perform various optimizations before handing the generated code to an (optimizing) C compiler. On the Data Encryption Standard (DES) algorithm, we show that Usuba outperforms a reference, hand-tuned implementation by 15% (using Intel's 64 bits general-purpose registers and depending on the underlying C compiler) whilst our implementation also transparently supports modern SIMD extensions (SSE, AVX, AVX-512), other architectures (ARM Neon, IBM Altivec) as well as multicore processors through an OpenMP backend.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127877273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout","authors":"M. Springer, H. Masuhara","doi":"10.1145/3178433.3178439","DOIUrl":"https://doi.org/10.1145/3178433.3178439","url":null,"abstract":"Structure of Arrays (SOA) is a well-studied data layout technique for SIMD architectures. Previous work has shown that it can speed up applications in high-performance computing by several factors compared to a traditional Array of Structures (AOS) layout. However, most programmers are used to AOS-style programming, which is more readable and easier to maintain. We present Ikra-Cpp, an embedded DSL for object-oriented programming in C++/CUDA. Ikra-Cpp's notation is very close to standard AOS-style C++ code, but data is layed out as SOA. This gives programmers the performance benefit of SOA and the expressiveness of AOS-style object-oriented programming at the same time. Ikra-Cpp is well integrated with C++ and lets programmers use C++ notation and syntax for classes, fields, member functions, constructors and instance creation.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126026262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}