使用先进的矢量扩展AVX-512的MPI减少

Dong Zhong, Qinglei Cao, G. Bosilca, J. Dongarra
{"title":"使用先进的矢量扩展AVX-512的MPI减少","authors":"Dong Zhong, Qinglei Cao, G. Bosilca, J. Dongarra","doi":"10.1145/3416315.3416316","DOIUrl":null,"url":null,"abstract":"As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.","PeriodicalId":176723,"journal":{"name":"Proceedings of the 27th European MPI Users' Group Meeting","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Using Advanced Vector Extensions AVX-512 for MPI Reductions\",\"authors\":\"Dong Zhong, Qinglei Cao, G. Bosilca, J. Dongarra\",\"doi\":\"10.1145/3416315.3416316\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.\",\"PeriodicalId\":176723,\"journal\":{\"name\":\"Proceedings of the 27th European MPI Users' Group Meeting\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 27th European MPI Users' Group Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3416315.3416316\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3416315.3416316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

随着高性能计算(HPC)系统规模的不断增长,研究人员致力于探索提高并行性水平以实现最佳性能。现代CPU的设计,包括其分层存储和SIMD/矢量化能力的特点,决定了算法的效率。最近引入的宽向量指令集扩展(AVX和SVE)促使向量化成为提高效率和缩小与峰值性能差距的关键。在本文中,我们提出了一种预定义的MPI约简操作的实现,利用AVX, AVX2和AVX-512本征来提供基于向量的约简操作,并改善这些预定义的MPI约简操作的求解时间。通过这些优化,我们实现了更高的本地计算效率,这直接有利于集体减少的总体成本。对不同场景下的软件栈进行了评估,结果表明该方案具有通用性和高效性。在Intel至强Gold集群上进行的实验表明,AVX-512优化的缩减操作比Open MPI默认的MPI本地缩减实现了10倍的性能优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Using Advanced Vector Extensions AVX-512 for MPI Reductions
As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信