Sameer Kumar, Yogish Sabharwal, R. Garg, P. Heidelberger
{"title":"蓝色基因/L超级计算机上全对全通信的优化","authors":"Sameer Kumar, Yogish Sabharwal, R. Garg, P. Heidelberger","doi":"10.1109/ICPP.2008.83","DOIUrl":null,"url":null,"abstract":"All-to-all communication is a well known performance bottleneck for many applications, such as the ones that use the Fast-Fourier-transform (FFT) algorithm. We analyze the performance of all-to-all communication on the BlueGene/L torus interconnect that has link contention even for all-to-all operations with short messages. We observed that the performance of all-to-all depends on the shape of the processor partition. We present a performance analysis of all-to-all on partitions of various shapes. We then present optimization schemes that substantially improve the performance of all-to-all with short and large messages.In particular, throughput improved from 64% to over 99% of peak on the 65,536 (64 times 32 times 32) node Blue Gene/L machine at the Lawrence Livermore National Lab. We show the impact of the all-to-all performance optimizations in 1-D and 3-D FFT benchmarks. We achieved a performance of over 2.8 TF for the HPC Challenge 1D FFT benchmark with our optimized all-to-all.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"70","resultStr":"{\"title\":\"Optimization of All-to-All Communication on the Blue Gene/L Supercomputer\",\"authors\":\"Sameer Kumar, Yogish Sabharwal, R. Garg, P. Heidelberger\",\"doi\":\"10.1109/ICPP.2008.83\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"All-to-all communication is a well known performance bottleneck for many applications, such as the ones that use the Fast-Fourier-transform (FFT) algorithm. We analyze the performance of all-to-all communication on the BlueGene/L torus interconnect that has link contention even for all-to-all operations with short messages. We observed that the performance of all-to-all depends on the shape of the processor partition. We present a performance analysis of all-to-all on partitions of various shapes. We then present optimization schemes that substantially improve the performance of all-to-all with short and large messages.In particular, throughput improved from 64% to over 99% of peak on the 65,536 (64 times 32 times 32) node Blue Gene/L machine at the Lawrence Livermore National Lab. We show the impact of the all-to-all performance optimizations in 1-D and 3-D FFT benchmarks. We achieved a performance of over 2.8 TF for the HPC Challenge 1D FFT benchmark with our optimized all-to-all.\",\"PeriodicalId\":388408,\"journal\":{\"name\":\"2008 37th International Conference on Parallel Processing\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"70\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 37th International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPP.2008.83\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 37th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2008.83","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 70
摘要
对于许多应用程序(例如使用快速傅里叶变换(FFT)算法的应用程序)来说,全对全通信是一个众所周知的性能瓶颈。我们分析了BlueGene/L环面互连中存在链路争用的全对全通信性能,即使是短消息的全对全操作。我们观察到,全对全的性能取决于处理器分区的形状。我们对各种形状的分区进行了全对全的性能分析。然后,我们提出了优化方案,大大提高了短消息和大消息的所有对所有的性能。特别是,在Lawrence Livermore National Lab的65,536(64乘以32乘以32)节点Blue Gene/L机器上,吞吐量从峰值的64%提高到99%以上。我们在1-D和3-D FFT基准测试中展示了所有对所有性能优化的影响。通过优化的全对全,我们在HPC挑战1D FFT基准测试中实现了超过2.8 TF的性能。
Optimization of All-to-All Communication on the Blue Gene/L Supercomputer
All-to-all communication is a well known performance bottleneck for many applications, such as the ones that use the Fast-Fourier-transform (FFT) algorithm. We analyze the performance of all-to-all communication on the BlueGene/L torus interconnect that has link contention even for all-to-all operations with short messages. We observed that the performance of all-to-all depends on the shape of the processor partition. We present a performance analysis of all-to-all on partitions of various shapes. We then present optimization schemes that substantially improve the performance of all-to-all with short and large messages.In particular, throughput improved from 64% to over 99% of peak on the 65,536 (64 times 32 times 32) node Blue Gene/L machine at the Lawrence Livermore National Lab. We show the impact of the all-to-all performance optimizations in 1-D and 3-D FFT benchmarks. We achieved a performance of over 2.8 TF for the HPC Challenge 1D FFT benchmark with our optimized all-to-all.