利用数据重用优化模板代码

Xun Chang, Li Shen, Qiong Wang
{"title":"利用数据重用优化模板代码","authors":"Xun Chang, Li Shen, Qiong Wang","doi":"10.1109/iceert53919.2021.00018","DOIUrl":null,"url":null,"abstract":"Stencil code is widely used in the field of scientific computing. Currently, researchers are focusing on performance optimization for stencil applications by data-level parallelism or thread-level parallelism. Using vector/SIMD instructions, which is commonly used to achieve data-level parallelism, could effectively improve the performance of computation with a large number of repetitive operations, but usually limited due to the access memory bandwidth, or data and control dependencies. The Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA), as the new generation of ARM’s vector ISA, could make vectorization more flexible by ignoring the vector register length, and has replaced the older Neon SIMD technology. In this paper we design ARM SVE instructions to implement and optimize 2d5p, 2d9p, 3d7p, and 3d27p stencil codes that are all the most common types using some classical optimization strategies like loop unrolling or data reuse. Our experiments on ARM processors using different vector lengths from 128-bit to 2048-bit show that our program could obtain performance improvements of up to 2.88x over directly vectorized code, 8.91x compared to Neon, and 16.31x for scalar code. In addition, we provide a set of templates that could be flexibly configured when stencil codes change, which can help directly generate efficient ARM SVE instructions. This work will provide great convenience for optimizing other stencil codes.","PeriodicalId":278054,"journal":{"name":"2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizing Stencil Codes with Exploiting Data Reuse\",\"authors\":\"Xun Chang, Li Shen, Qiong Wang\",\"doi\":\"10.1109/iceert53919.2021.00018\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stencil code is widely used in the field of scientific computing. Currently, researchers are focusing on performance optimization for stencil applications by data-level parallelism or thread-level parallelism. Using vector/SIMD instructions, which is commonly used to achieve data-level parallelism, could effectively improve the performance of computation with a large number of repetitive operations, but usually limited due to the access memory bandwidth, or data and control dependencies. The Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA), as the new generation of ARM’s vector ISA, could make vectorization more flexible by ignoring the vector register length, and has replaced the older Neon SIMD technology. In this paper we design ARM SVE instructions to implement and optimize 2d5p, 2d9p, 3d7p, and 3d27p stencil codes that are all the most common types using some classical optimization strategies like loop unrolling or data reuse. Our experiments on ARM processors using different vector lengths from 128-bit to 2048-bit show that our program could obtain performance improvements of up to 2.88x over directly vectorized code, 8.91x compared to Neon, and 16.31x for scalar code. In addition, we provide a set of templates that could be flexibly configured when stencil codes change, which can help directly generate efficient ARM SVE instructions. This work will provide great convenience for optimizing other stencil codes.\",\"PeriodicalId\":278054,\"journal\":{\"name\":\"2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iceert53919.2021.00018\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iceert53919.2021.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

模板码广泛应用于科学计算领域。目前,研究人员主要关注通过数据级并行或线程级并行来优化模板应用程序的性能。使用矢量/SIMD指令(通常用于实现数据级并行性)可以有效地提高具有大量重复操作的计算性能,但通常由于访问内存带宽或数据和控制依赖性而受到限制。可扩展矢量扩展(SVE)是矢量长度不确定(VLA),作为ARM的新一代矢量ISA,可以通过忽略矢量寄存器长度来使矢量化更加灵活,并取代了旧的Neon SIMD技术。在本文中,我们设计了ARM SVE指令来实现和优化2d5p, 2d9p, 3d7p和3d27p模板代码,这些代码都是最常见的类型,使用一些经典的优化策略,如循环展开或数据重用。我们在ARM处理器上使用从128位到2048位的不同向量长度的实验表明,我们的程序可以比直接向量化代码获得高达2.88倍的性能改进,比Neon代码提高8.91倍,比标量代码提高16.31倍。此外,我们还提供了一组模板,可以在模板代码变化时灵活配置,从而可以直接生成高效的ARM SVE指令。该工作将为其他模板代码的优化提供极大的便利。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Optimizing Stencil Codes with Exploiting Data Reuse
Stencil code is widely used in the field of scientific computing. Currently, researchers are focusing on performance optimization for stencil applications by data-level parallelism or thread-level parallelism. Using vector/SIMD instructions, which is commonly used to achieve data-level parallelism, could effectively improve the performance of computation with a large number of repetitive operations, but usually limited due to the access memory bandwidth, or data and control dependencies. The Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA), as the new generation of ARM’s vector ISA, could make vectorization more flexible by ignoring the vector register length, and has replaced the older Neon SIMD technology. In this paper we design ARM SVE instructions to implement and optimize 2d5p, 2d9p, 3d7p, and 3d27p stencil codes that are all the most common types using some classical optimization strategies like loop unrolling or data reuse. Our experiments on ARM processors using different vector lengths from 128-bit to 2048-bit show that our program could obtain performance improvements of up to 2.88x over directly vectorized code, 8.91x compared to Neon, and 16.31x for scalar code. In addition, we provide a set of templates that could be flexibly configured when stencil codes change, which can help directly generate efficient ARM SVE instructions. This work will provide great convenience for optimizing other stencil codes.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信