基于SIMT加速器的周期重新设计用于高性能周期动力学模拟

Xinyuan Li, Huang Ye, Jian Zhang
{"title":"基于SIMT加速器的周期重新设计用于高性能周期动力学模拟","authors":"Xinyuan Li, Huang Ye, Jian Zhang","doi":"10.1109/IPDPS49936.2021.00052","DOIUrl":null,"url":null,"abstract":"Peridigm is one of the most frequently utilized Peridynamics (PD) simulation software for problems involving discontinuity, such as cracks and fragmentation. However, performing long-term and large-scale simulations is very time-consuming for Peridigm. To enhance the performance and scalability of Peridigm, we port and optimize Peridigm on the SIMT accelerators. Challenges are imposed on efficient Peridigm on the SIMT architecture by the complex calculations and massive memory access of PD simulations. In this study, a series of strategies and techniques are proposed to optimize the performance of Peridigm. We first adjust the algorithms of bond-based calculations to eliminate the data conflicts with minimized overhead in order to achieve parallel Peridigm on accelerators. Furthermore, we propose thread grouping and collaborative memory access strategies to decrease the overhead of data fetch from device memory. To improve the efficiency of calculations, we also refine the calculation instructions. Finally, we offer a transmission-computation overlapping strategy for reducing the overhead brought by the data transmissions and improving the scalability. The optimized Peridigm on 4 Nvidia Tesla V100 GPUs accelerates the basic parallel Peridigm on 4 V100 GPUs 10.24 times. Compared to the original Peridigm run on 8 Intel Xeon Gold 6248 CPUs (160 cores, 320 threads) and the optimized PD application run on 4 SW26010 processors (1,040 cores), our work on 4 V100 GPUs accelerates the simulation 9 times and 4 times respectively. As for large-scale simulations, because we don’t have enough V100 GPUs, we run our work on noncommercial SIMT accelerators which have similar performance to the V100 of the PCIe version, with the example scales from 282,000 points to 36,096,000 points and the number of accelerators scales from 4 to 512, near-linear scalability is observed and the performance ultimately reaching 825.72 TFLOPS with 98.81% parallel efficiency","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"162 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Redesigning Peridigm on SIMT Accelerators for High-performance Peridynamics Simulations\",\"authors\":\"Xinyuan Li, Huang Ye, Jian Zhang\",\"doi\":\"10.1109/IPDPS49936.2021.00052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Peridigm is one of the most frequently utilized Peridynamics (PD) simulation software for problems involving discontinuity, such as cracks and fragmentation. However, performing long-term and large-scale simulations is very time-consuming for Peridigm. To enhance the performance and scalability of Peridigm, we port and optimize Peridigm on the SIMT accelerators. Challenges are imposed on efficient Peridigm on the SIMT architecture by the complex calculations and massive memory access of PD simulations. In this study, a series of strategies and techniques are proposed to optimize the performance of Peridigm. We first adjust the algorithms of bond-based calculations to eliminate the data conflicts with minimized overhead in order to achieve parallel Peridigm on accelerators. Furthermore, we propose thread grouping and collaborative memory access strategies to decrease the overhead of data fetch from device memory. To improve the efficiency of calculations, we also refine the calculation instructions. Finally, we offer a transmission-computation overlapping strategy for reducing the overhead brought by the data transmissions and improving the scalability. The optimized Peridigm on 4 Nvidia Tesla V100 GPUs accelerates the basic parallel Peridigm on 4 V100 GPUs 10.24 times. Compared to the original Peridigm run on 8 Intel Xeon Gold 6248 CPUs (160 cores, 320 threads) and the optimized PD application run on 4 SW26010 processors (1,040 cores), our work on 4 V100 GPUs accelerates the simulation 9 times and 4 times respectively. As for large-scale simulations, because we don’t have enough V100 GPUs, we run our work on noncommercial SIMT accelerators which have similar performance to the V100 of the PCIe version, with the example scales from 282,000 points to 36,096,000 points and the number of accelerators scales from 4 to 512, near-linear scalability is observed and the performance ultimately reaching 825.72 TFLOPS with 98.81% parallel efficiency\",\"PeriodicalId\":372234,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"162 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS49936.2021.00052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

Peridigm是最常用的周期动力学(PD)模拟软件之一,用于解决涉及不连续的问题,如裂缝和破碎。然而,对于Peridigm来说,执行长期和大规模的模拟是非常耗时的。为了提高Peridigm的性能和可扩展性,我们在SIMT加速器上移植和优化了Peridigm。PD模拟的复杂计算和海量内存访问给SIMT体系结构的有效周期带来了挑战。在本研究中,提出了一系列优化Peridigm性能的策略和技术。我们首先调整了基于键的计算算法,以最小的开销消除数据冲突,从而实现加速器上的并行周期。此外,我们提出了线程分组和协作内存访问策略,以减少从设备内存中获取数据的开销。为了提高计算效率,我们还对计算指令进行了细化。最后,我们提出了一种传输-计算重叠策略,以减少数据传输带来的开销,提高可扩展性。优化后的4个Nvidia Tesla V100 gpu上的Peridigm比4个V100 gpu上的基本并行Peridigm加速10.24倍。与在8个Intel Xeon Gold 6248 cpu(160核,320线程)上运行的原始Peridigm和在4个SW26010处理器(1,040核)上运行的优化PD应用程序相比,我们在4个V100 gpu上的工作分别加速了9倍和4倍。对于大规模模拟,由于我们没有足够的V100 gpu,我们在性能与PCIe版本的V100相似的非商业SIMT加速器上运行我们的工作,示例从282,000点扩展到36,096,000点,加速器数量从4个扩展到512个,观察到近似线性的可扩展性,性能最终达到825.72 TFLOPS,并行效率为98.81%
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Redesigning Peridigm on SIMT Accelerators for High-performance Peridynamics Simulations
Peridigm is one of the most frequently utilized Peridynamics (PD) simulation software for problems involving discontinuity, such as cracks and fragmentation. However, performing long-term and large-scale simulations is very time-consuming for Peridigm. To enhance the performance and scalability of Peridigm, we port and optimize Peridigm on the SIMT accelerators. Challenges are imposed on efficient Peridigm on the SIMT architecture by the complex calculations and massive memory access of PD simulations. In this study, a series of strategies and techniques are proposed to optimize the performance of Peridigm. We first adjust the algorithms of bond-based calculations to eliminate the data conflicts with minimized overhead in order to achieve parallel Peridigm on accelerators. Furthermore, we propose thread grouping and collaborative memory access strategies to decrease the overhead of data fetch from device memory. To improve the efficiency of calculations, we also refine the calculation instructions. Finally, we offer a transmission-computation overlapping strategy for reducing the overhead brought by the data transmissions and improving the scalability. The optimized Peridigm on 4 Nvidia Tesla V100 GPUs accelerates the basic parallel Peridigm on 4 V100 GPUs 10.24 times. Compared to the original Peridigm run on 8 Intel Xeon Gold 6248 CPUs (160 cores, 320 threads) and the optimized PD application run on 4 SW26010 processors (1,040 cores), our work on 4 V100 GPUs accelerates the simulation 9 times and 4 times respectively. As for large-scale simulations, because we don’t have enough V100 GPUs, we run our work on noncommercial SIMT accelerators which have similar performance to the V100 of the PCIe version, with the example scales from 282,000 points to 36,096,000 points and the number of accelerators scales from 4 to 512, near-linear scalability is observed and the performance ultimately reaching 825.72 TFLOPS with 98.81% parallel efficiency
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信