利用数据重用优化模板代码

2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT) Pub Date : 2021-10-01 DOI:10.1109/iceert53919.2021.00018

Xun Chang, Li Shen, Qiong Wang

{"title":"利用数据重用优化模板代码","authors":"Xun Chang, Li Shen, Qiong Wang","doi":"10.1109/iceert53919.2021.00018","DOIUrl":null,"url":null,"abstract":"Stencil code is widely used in the field of scientific computing. Currently, researchers are focusing on performance optimization for stencil applications by data-level parallelism or thread-level parallelism. Using vector/SIMD instructions, which is commonly used to achieve data-level parallelism, could effectively improve the performance of computation with a large number of repetitive operations, but usually limited due to the access memory bandwidth, or data and control dependencies. The Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA), as the new generation of ARM’s vector ISA, could make vectorization more flexible by ignoring the vector register length, and has replaced the older Neon SIMD technology. In this paper we design ARM SVE instructions to implement and optimize 2d5p, 2d9p, 3d7p, and 3d27p stencil codes that are all the most common types using some classical optimization strategies like loop unrolling or data reuse. Our experiments on ARM processors using different vector lengths from 128-bit to 2048-bit show that our program could obtain performance improvements of up to 2.88x over directly vectorized code, 8.91x compared to Neon, and 16.31x for scalar code. In addition, we provide a set of templates that could be flexibly configured when stencil codes change, which can help directly generate efficient ARM SVE instructions. This work will provide great convenience for optimizing other stencil codes.","PeriodicalId":278054,"journal":{"name":"2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizing Stencil Codes with Exploiting Data Reuse\",\"authors\":\"Xun Chang, Li Shen, Qiong Wang\",\"doi\":\"10.1109/iceert53919.2021.00018\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stencil code is widely used in the field of scientific computing. Currently, researchers are focusing on performance optimization for stencil applications by data-level parallelism or thread-level parallelism. Using vector/SIMD instructions, which is commonly used to achieve data-level parallelism, could effectively improve the performance of computation with a large number of repetitive operations, but usually limited due to the access memory bandwidth, or data and control dependencies. The Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA), as the new generation of ARM’s vector ISA, could make vectorization more flexible by ignoring the vector register length, and has replaced the older Neon SIMD technology. In this paper we design ARM SVE instructions to implement and optimize 2d5p, 2d9p, 3d7p, and 3d27p stencil codes that are all the most common types using some classical optimization strategies like loop unrolling or data reuse. Our experiments on ARM processors using different vector lengths from 128-bit to 2048-bit show that our program could obtain performance improvements of up to 2.88x over directly vectorized code, 8.91x compared to Neon, and 16.31x for scalar code. In addition, we provide a set of templates that could be flexibly configured when stencil codes change, which can help directly generate efficient ARM SVE instructions. This work will provide great convenience for optimizing other stencil codes.\",\"PeriodicalId\":278054,\"journal\":{\"name\":\"2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iceert53919.2021.00018\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iceert53919.2021.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

模板码广泛应用于科学计算领域。目前，研究人员主要关注通过数据级并行或线程级并行来优化模板应用程序的性能。使用矢量/SIMD指令(通常用于实现数据级并行性)可以有效地提高具有大量重复操作的计算性能，但通常由于访问内存带宽或数据和控制依赖性而受到限制。可扩展矢量扩展(SVE)是矢量长度不确定(VLA)，作为ARM的新一代矢量ISA，可以通过忽略矢量寄存器长度来使矢量化更加灵活，并取代了旧的Neon SIMD技术。在本文中，我们设计了ARM SVE指令来实现和优化2d5p, 2d9p, 3d7p和3d27p模板代码，这些代码都是最常见的类型，使用一些经典的优化策略，如循环展开或数据重用。我们在ARM处理器上使用从128位到2048位的不同向量长度的实验表明，我们的程序可以比直接向量化代码获得高达2.88倍的性能改进，比Neon代码提高8.91倍，比标量代码提高16.31倍。此外，我们还提供了一组模板，可以在模板代码变化时灵活配置，从而可以直接生成高效的ARM SVE指令。该工作将为其他模板代码的优化提供极大的便利。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing Stencil Codes with Exploiting Data Reuse

Stencil code is widely used in the field of scientific computing. Currently, researchers are focusing on performance optimization for stencil applications by data-level parallelism or thread-level parallelism. Using vector/SIMD instructions, which is commonly used to achieve data-level parallelism, could effectively improve the performance of computation with a large number of repetitive operations, but usually limited due to the access memory bandwidth, or data and control dependencies. The Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA), as the new generation of ARM’s vector ISA, could make vectorization more flexible by ignoring the vector register length, and has replaced the older Neon SIMD technology. In this paper we design ARM SVE instructions to implement and optimize 2d5p, 2d9p, 3d7p, and 3d27p stencil codes that are all the most common types using some classical optimization strategies like loop unrolling or data reuse. Our experiments on ARM processors using different vector lengths from 128-bit to 2048-bit show that our program could obtain performance improvements of up to 2.88x over directly vectorized code, 8.91x compared to Neon, and 16.31x for scalar code. In addition, we provide a set of templates that could be flexibly configured when stencil codes change, which can help directly generate efficient ARM SVE instructions. This work will provide great convenience for optimizing other stencil codes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)

自引率

0.00%

发文量