{"title":"基于SIMT加速器的周期重新设计用于高性能周期动力学模拟","authors":"Xinyuan Li, Huang Ye, Jian Zhang","doi":"10.1109/IPDPS49936.2021.00052","DOIUrl":null,"url":null,"abstract":"Peridigm is one of the most frequently utilized Peridynamics (PD) simulation software for problems involving discontinuity, such as cracks and fragmentation. However, performing long-term and large-scale simulations is very time-consuming for Peridigm. To enhance the performance and scalability of Peridigm, we port and optimize Peridigm on the SIMT accelerators. Challenges are imposed on efficient Peridigm on the SIMT architecture by the complex calculations and massive memory access of PD simulations. In this study, a series of strategies and techniques are proposed to optimize the performance of Peridigm. We first adjust the algorithms of bond-based calculations to eliminate the data conflicts with minimized overhead in order to achieve parallel Peridigm on accelerators. Furthermore, we propose thread grouping and collaborative memory access strategies to decrease the overhead of data fetch from device memory. To improve the efficiency of calculations, we also refine the calculation instructions. Finally, we offer a transmission-computation overlapping strategy for reducing the overhead brought by the data transmissions and improving the scalability. The optimized Peridigm on 4 Nvidia Tesla V100 GPUs accelerates the basic parallel Peridigm on 4 V100 GPUs 10.24 times. Compared to the original Peridigm run on 8 Intel Xeon Gold 6248 CPUs (160 cores, 320 threads) and the optimized PD application run on 4 SW26010 processors (1,040 cores), our work on 4 V100 GPUs accelerates the simulation 9 times and 4 times respectively. As for large-scale simulations, because we don’t have enough V100 GPUs, we run our work on noncommercial SIMT accelerators which have similar performance to the V100 of the PCIe version, with the example scales from 282,000 points to 36,096,000 points and the number of accelerators scales from 4 to 512, near-linear scalability is observed and the performance ultimately reaching 825.72 TFLOPS with 98.81% parallel efficiency","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"162 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Redesigning Peridigm on SIMT Accelerators for High-performance Peridynamics Simulations\",\"authors\":\"Xinyuan Li, Huang Ye, Jian Zhang\",\"doi\":\"10.1109/IPDPS49936.2021.00052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Peridigm is one of the most frequently utilized Peridynamics (PD) simulation software for problems involving discontinuity, such as cracks and fragmentation. However, performing long-term and large-scale simulations is very time-consuming for Peridigm. To enhance the performance and scalability of Peridigm, we port and optimize Peridigm on the SIMT accelerators. Challenges are imposed on efficient Peridigm on the SIMT architecture by the complex calculations and massive memory access of PD simulations. In this study, a series of strategies and techniques are proposed to optimize the performance of Peridigm. We first adjust the algorithms of bond-based calculations to eliminate the data conflicts with minimized overhead in order to achieve parallel Peridigm on accelerators. Furthermore, we propose thread grouping and collaborative memory access strategies to decrease the overhead of data fetch from device memory. To improve the efficiency of calculations, we also refine the calculation instructions. Finally, we offer a transmission-computation overlapping strategy for reducing the overhead brought by the data transmissions and improving the scalability. The optimized Peridigm on 4 Nvidia Tesla V100 GPUs accelerates the basic parallel Peridigm on 4 V100 GPUs 10.24 times. Compared to the original Peridigm run on 8 Intel Xeon Gold 6248 CPUs (160 cores, 320 threads) and the optimized PD application run on 4 SW26010 processors (1,040 cores), our work on 4 V100 GPUs accelerates the simulation 9 times and 4 times respectively. As for large-scale simulations, because we don’t have enough V100 GPUs, we run our work on noncommercial SIMT accelerators which have similar performance to the V100 of the PCIe version, with the example scales from 282,000 points to 36,096,000 points and the number of accelerators scales from 4 to 512, near-linear scalability is observed and the performance ultimately reaching 825.72 TFLOPS with 98.81% parallel efficiency\",\"PeriodicalId\":372234,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"162 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS49936.2021.00052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Redesigning Peridigm on SIMT Accelerators for High-performance Peridynamics Simulations
Peridigm is one of the most frequently utilized Peridynamics (PD) simulation software for problems involving discontinuity, such as cracks and fragmentation. However, performing long-term and large-scale simulations is very time-consuming for Peridigm. To enhance the performance and scalability of Peridigm, we port and optimize Peridigm on the SIMT accelerators. Challenges are imposed on efficient Peridigm on the SIMT architecture by the complex calculations and massive memory access of PD simulations. In this study, a series of strategies and techniques are proposed to optimize the performance of Peridigm. We first adjust the algorithms of bond-based calculations to eliminate the data conflicts with minimized overhead in order to achieve parallel Peridigm on accelerators. Furthermore, we propose thread grouping and collaborative memory access strategies to decrease the overhead of data fetch from device memory. To improve the efficiency of calculations, we also refine the calculation instructions. Finally, we offer a transmission-computation overlapping strategy for reducing the overhead brought by the data transmissions and improving the scalability. The optimized Peridigm on 4 Nvidia Tesla V100 GPUs accelerates the basic parallel Peridigm on 4 V100 GPUs 10.24 times. Compared to the original Peridigm run on 8 Intel Xeon Gold 6248 CPUs (160 cores, 320 threads) and the optimized PD application run on 4 SW26010 processors (1,040 cores), our work on 4 V100 GPUs accelerates the simulation 9 times and 4 times respectively. As for large-scale simulations, because we don’t have enough V100 GPUs, we run our work on noncommercial SIMT accelerators which have similar performance to the V100 of the PCIe version, with the example scales from 282,000 points to 36,096,000 points and the number of accelerators scales from 4 to 512, near-linear scalability is observed and the performance ultimately reaching 825.72 TFLOPS with 98.81% parallel efficiency