Vectorization of a spectral finite-element numerical kernel

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing Pub Date : 2018-02-24 DOI:10.1145/3178433.3178441

S. Jubertie, F. Dupros, F. D. Martin

{"title":"Vectorization of a spectral finite-element numerical kernel","authors":"S. Jubertie, F. Dupros, F. D. Martin","doi":"10.1145/3178433.3178441","DOIUrl":null,"url":null,"abstract":"In this paper, we present an optimized implementation of the Finite-Element Methods numerical kernel for SIMD vectorization. A typical application is the modelling of seismic wave propagation. In this case, the computations at the element level are generally based on nested loops where the memory accesses are non-contiguous. Moreover, the back and forth from the element level to the global level (e.g., assembly phase) is a serious brake for automatic vectorization by compilers and for efficient reuse of data at the cache memory levels. This is particularly true when the problem under study relies on an unstructured mesh. The application proxies used for our experiments were extracted from EFISPEC code that implements the spectral finite-element method to solve the elastodynamic equations. We underline that the intra-node performance may be further improved. Additionally, we show that standard compilers such as GNU GCC, Clang and Intel ICC are unable to perform automatic vectorization even when the nested loops were reorganized or when SIMD pragmas were added. Due to the irregular memory access pattern, we introduce a dedicated strategy to squeeze the maximum performance out of the SIMD units. Experiments are carried out on Intel Broadwell and Skylake platforms that respectively offer AVX2 and AVX-512 SIMD units. We believe that our vectorization approach may be generic enough to be adapted to other codes.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3178433.3178441","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

In this paper, we present an optimized implementation of the Finite-Element Methods numerical kernel for SIMD vectorization. A typical application is the modelling of seismic wave propagation. In this case, the computations at the element level are generally based on nested loops where the memory accesses are non-contiguous. Moreover, the back and forth from the element level to the global level (e.g., assembly phase) is a serious brake for automatic vectorization by compilers and for efficient reuse of data at the cache memory levels. This is particularly true when the problem under study relies on an unstructured mesh. The application proxies used for our experiments were extracted from EFISPEC code that implements the spectral finite-element method to solve the elastodynamic equations. We underline that the intra-node performance may be further improved. Additionally, we show that standard compilers such as GNU GCC, Clang and Intel ICC are unable to perform automatic vectorization even when the nested loops were reorganized or when SIMD pragmas were added. Due to the irregular memory access pattern, we introduce a dedicated strategy to squeeze the maximum performance out of the SIMD units. Experiments are carried out on Intel Broadwell and Skylake platforms that respectively offer AVX2 and AVX-512 SIMD units. We believe that our vectorization approach may be generic enough to be adapted to other codes.

查看原文本刊更多论文

谱有限元数值核的矢量化

在本文中，我们提出了一个优化实现的有限元方法数值核SIMD矢量化。一个典型的应用是地震波传播的建模。在这种情况下，元素级别的计算通常基于嵌套循环，其中内存访问是非连续的。此外，从元素级到全局级(例如，汇编阶段)的来回转换严重阻碍了编译器的自动向量化和缓存内存级数据的有效重用。当所研究的问题依赖于非结构化网格时尤其如此。实验中使用的应用代理是从EFISPEC代码中提取的，该代码实现了谱有限元法求解弹性动力学方程。我们强调节点内性能可以进一步提高。此外，我们还表明，即使在重新组织嵌套循环或添加SIMD pragmas时，GNU GCC、Clang和Intel ICC等标准编译器也无法执行自动向量化。由于不规则的内存访问模式，我们引入了一种专门的策略来从SIMD单元中挤出最大的性能。实验在英特尔Broadwell和Skylake平台上进行，分别提供AVX2和AVX-512 SIMD单元。我们相信，我们的向量化方法可能是通用的，足以适应其他代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing

自引率

0.00%

发文量