Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing最新文献

筛选
英文 中文
Investigating automatic vectorization for real-time 3D scene understanding 研究实时三维场景理解的自动矢量化
A. Nica, E. Vespa, Pablo González de Aledo Marugán, P. Kelly
{"title":"Investigating automatic vectorization for real-time 3D scene understanding","authors":"A. Nica, E. Vespa, Pablo González de Aledo Marugán, P. Kelly","doi":"10.1145/3178433.3178438","DOIUrl":"https://doi.org/10.1145/3178433.3178438","url":null,"abstract":"Simultaneous Localization And Mapping (SLAM) is the problem of building a representation of a geometric space while simultaneously estimating the observer's location within the space. While this seems to be a chicken-and-egg problem, several algorithms have appeared in the last decades that approximately and iteratively solve this problem. SLAM algorithms are tailored to the available resources, hence aimed at balancing the precision of the map with the constraints that the computational platform imposes and the desire to obtain real-time results. Working with KinectFusion, an established SLAM implementation, we explore in this work the vectorization opportunities present in this scenario, with the goal of using the CPU to its full potential. Using ISPC, an automatic vectorization tool, we produce a partially vectorized version of KinectFusion. Along the way we explore a number of optimization strategies, among which tiling to exploit ray-coherence and outer loop vectorization, obtaining up to 4x speed-up over the baseline on an 8-wide vector machine.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123463912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Data Layout Transformation for Vectorizing Compilers 面向向量化编译器的数据布局转换
Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Simon Moll, Roland Leißa, Sebastian Hack
{"title":"A Data Layout Transformation for Vectorizing Compilers","authors":"Arsène Pérard-Gayot, Richard Membarth, P. Slusallek, Simon Moll, Roland Leißa, Sebastian Hack","doi":"10.1145/3178433.3178440","DOIUrl":"https://doi.org/10.1145/3178433.3178440","url":null,"abstract":"Modern processors are often equipped with vector instruction sets. Such instructions operate on multiple elements of data at once, and greatly improve performance for specific applications. A programmer has two options to take advantage of these instructions: writing manually vectorized code, or using an auto-vectorizing compiler. In the latter case, he only has to place annotations to instruct the auto-vectorizing compiler to vectorize a particular piece of code. Thanks to auto-vectorization, the source program remains portable, and the programmer can focus on the task at hand instead of the low-level details of intrinsics programming. However, the performance of the vectorized program strongly depends on the precision of the analyses performed by the vectorizing compiler. In this paper, we improve the precision of these analyses by selectively splitting stack-allocated variables of a structure or aggregate type. Without this optimization, automatic vectorization slows the execution down compared to the scalar, non-vectorized code. When this optimization is enabled, we show that the vectorized code can be as fast as hand-optimized, manually vectorized implementations.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131204298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Small SIMD Matrices for CERN High Throughput Computing 用于CERN高通量计算的小型SIMD矩阵
F. Lemaitre, Benjamin Couturier, L. Lacassagne
{"title":"Small SIMD Matrices for CERN High Throughput Computing","authors":"F. Lemaitre, Benjamin Couturier, L. Lacassagne","doi":"10.1145/3178433.3178434","DOIUrl":"https://doi.org/10.1145/3178433.3178434","url":null,"abstract":"System tracking is an old problem and has been heavily optimized throughout the past. However, in High Energy Physics, many small systems are tracked in real-time using Kalman filtering and no implementation satisfying those constraints currently exists. In this paper, we present a code generator used to speed up Cholesky Factorization and Kalman Filter for small matrices. The generator is easy to use and produces portable and heavily optimized code. We focus on current SIMD architectures (SSE, AVX, AVX512, Neon, SVE, Altivec and VSX). Our Cholesky factorization outperforms any existing libraries: from x3 to x10 faster than MKL. The Kalman Filter is also faster than existing implementations, and achieves 4 · 109 iter/s on a 2x24C Intel Xeon.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132451570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard MIPP:可移植c++ SIMD包装器及其在5G标准中纠错编码中的应用
Adrien Cassagne, Olivier Aumage, Denis Barthou, Camille Leroux, C. Jégo
{"title":"MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard","authors":"Adrien Cassagne, Olivier Aumage, Denis Barthou, Camille Leroux, C. Jégo","doi":"10.1145/3178433.3178435","DOIUrl":"https://doi.org/10.1145/3178433.3178435","url":null,"abstract":"Error correction code (ECC) processing has so far been performed on dedicated hardware for previous generations of mobile communication standards, to meet latency and bandwidth constraints. As the 5G mobile standard, and its associated channel coding algorithms, are now being specified, modern CPUs are progressing to the point where software channel decoders can viably be contemplated. A key aspect in reaching this transition point is to get the most of CPUs SIMD units on the decoding algorithms being pondered for 5G mobile standards. The nature and diversity of such algorithms requires highly versatile programming tools. This paper demonstrates the virtues and versatility of our MIPP SIMD wrapper in implementing a high performance portfolio of key ECC decoding algorithms.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127866584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors 宽SIMD矢量处理器的小张量乘法核的sim化
Christopher I. Rodrigues, Amarin Phaosawasdi, Peng Wu
{"title":"SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors","authors":"Christopher I. Rodrigues, Amarin Phaosawasdi, Peng Wu","doi":"10.1145/3178433.3178436","DOIUrl":"https://doi.org/10.1145/3178433.3178436","url":null,"abstract":"Developers often rely on automatic vectorization to speed up fine-grained data-parallel code. However, for loop nests where the loops are shorter than the processor's SIMD width, automatic vectorization performs poorly. Vectorizers attempt to vectorize a single short loop, using (at best) a fraction of the processor's SIMD capacity. It is not straightforward to vectorize multiple nested loops together because they typically have memory accesses with multiple strides, which conventional methods cannot profitably vectorize. We present a solution in the context of compiling small tensor multiplication. Our compiler vectorizes several inner loops in order to utilize wide vector parallelism. To handle complicated strides, we devise a vectorizable form of loop tiling. The compiler transforms loops to improve memory locality, then caches tiles of data in vector registers. Strided access patterns are transformed into permute instructions. We show that our compiler is able to significantly speed up many small tensor multiplication algorithms. It judges 13.5% of a randomly generated sample of algorithms to be profitable to vectorize. On these, it generates code 1.55x as fast on average as that produced by GCC's state-of-the-art vectorizer, with a maximum speedup of 10x. We discuss potential extensions to vectorize more general algorithms.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131663963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Vectorization of a spectral finite-element numerical kernel 谱有限元数值核的矢量化
S. Jubertie, F. Dupros, F. D. Martin
{"title":"Vectorization of a spectral finite-element numerical kernel","authors":"S. Jubertie, F. Dupros, F. D. Martin","doi":"10.1145/3178433.3178441","DOIUrl":"https://doi.org/10.1145/3178433.3178441","url":null,"abstract":"In this paper, we present an optimized implementation of the Finite-Element Methods numerical kernel for SIMD vectorization. A typical application is the modelling of seismic wave propagation. In this case, the computations at the element level are generally based on nested loops where the memory accesses are non-contiguous. Moreover, the back and forth from the element level to the global level (e.g., assembly phase) is a serious brake for automatic vectorization by compilers and for efficient reuse of data at the cache memory levels. This is particularly true when the problem under study relies on an unstructured mesh. The application proxies used for our experiments were extracted from EFISPEC code that implements the spectral finite-element method to solve the elastodynamic equations. We underline that the intra-node performance may be further improved. Additionally, we show that standard compilers such as GNU GCC, Clang and Intel ICC are unable to perform automatic vectorization even when the nested loops were reorganized or when SIMD pragmas were added. Due to the irregular memory access pattern, we introduce a dedicated strategy to squeeze the maximum performance out of the SIMD units. Experiments are carried out on Intel Broadwell and Skylake platforms that respectively offer AVX2 and AVX-512 SIMD units. We believe that our vectorization approach may be generic enough to be adapted to other codes.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123651208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Usuba: Optimizing & Trustworthy Bitslicing Compiler Usuba:优化和值得信赖的位切片编译器
Darius Mercadier, Pierre-Évariste Dagand, L. Lacassagne, Gilles Muller
{"title":"Usuba: Optimizing & Trustworthy Bitslicing Compiler","authors":"Darius Mercadier, Pierre-Évariste Dagand, L. Lacassagne, Gilles Muller","doi":"10.1145/3178433.3178437","DOIUrl":"https://doi.org/10.1145/3178433.3178437","url":null,"abstract":"Bitslicing is a programming technique commonly used in cryptography that consists in implementing a combinational circuit in software. It results in a massively parallel program immune to cache-timing attacks by design. However, writing a program in bitsliced form requires extreme minutia. This paper introduces Usuba, a synchronous dataflow language producing bitsliced C code. Usuba is both a domain-specific language -- providing syntactic support for the implementation of cryptographic algorithms -- as well as a domain-specific compiler -- taking advantage of well-defined semantics invariants to perform various optimizations before handing the generated code to an (optimizing) C compiler. On the Data Encryption Standard (DES) algorithm, we show that Usuba outperforms a reference, hand-tuned implementation by 15% (using Intel's 64 bits general-purpose registers and depending on the underlying C compiler) whilst our implementation also transparently supports modern SIMD extensions (SSE, AVX, AVX-512), other architectures (ARM Neon, IBM Altivec) as well as multicore processors through an OpenMP backend.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127877273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout Ikra-Cpp:一个c++ /CUDA面向对象编程的数组结构布局DSL
M. Springer, H. Masuhara
{"title":"Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout","authors":"M. Springer, H. Masuhara","doi":"10.1145/3178433.3178439","DOIUrl":"https://doi.org/10.1145/3178433.3178439","url":null,"abstract":"Structure of Arrays (SOA) is a well-studied data layout technique for SIMD architectures. Previous work has shown that it can speed up applications in high-performance computing by several factors compared to a traditional Array of Structures (AOS) layout. However, most programmers are used to AOS-style programming, which is more readable and easier to maintain. We present Ikra-Cpp, an embedded DSL for object-oriented programming in C++/CUDA. Ikra-Cpp's notation is very close to standard AOS-style C++ code, but data is layed out as SOA. This gives programmers the performance benefit of SOA and the expressiveness of AOS-style object-oriented programming at the same time. Ikra-Cpp is well integrated with C++ and lets programmers use C++ notation and syntax for classes, fields, member functions, constructors and instance creation.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126026262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信