Sameer Kumar, Yogish Sabharwal, R. Garg, P. Heidelberger
{"title":"Optimization of All-to-All Communication on the Blue Gene/L Supercomputer","authors":"Sameer Kumar, Yogish Sabharwal, R. Garg, P. Heidelberger","doi":"10.1109/ICPP.2008.83","DOIUrl":null,"url":null,"abstract":"All-to-all communication is a well known performance bottleneck for many applications, such as the ones that use the Fast-Fourier-transform (FFT) algorithm. We analyze the performance of all-to-all communication on the BlueGene/L torus interconnect that has link contention even for all-to-all operations with short messages. We observed that the performance of all-to-all depends on the shape of the processor partition. We present a performance analysis of all-to-all on partitions of various shapes. We then present optimization schemes that substantially improve the performance of all-to-all with short and large messages.In particular, throughput improved from 64% to over 99% of peak on the 65,536 (64 times 32 times 32) node Blue Gene/L machine at the Lawrence Livermore National Lab. We show the impact of the all-to-all performance optimizations in 1-D and 3-D FFT benchmarks. We achieved a performance of over 2.8 TF for the HPC Challenge 1D FFT benchmark with our optimized all-to-all.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"70","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 37th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2008.83","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 70
Abstract
All-to-all communication is a well known performance bottleneck for many applications, such as the ones that use the Fast-Fourier-transform (FFT) algorithm. We analyze the performance of all-to-all communication on the BlueGene/L torus interconnect that has link contention even for all-to-all operations with short messages. We observed that the performance of all-to-all depends on the shape of the processor partition. We present a performance analysis of all-to-all on partitions of various shapes. We then present optimization schemes that substantially improve the performance of all-to-all with short and large messages.In particular, throughput improved from 64% to over 99% of peak on the 65,536 (64 times 32 times 32) node Blue Gene/L machine at the Lawrence Livermore National Lab. We show the impact of the all-to-all performance optimizations in 1-D and 3-D FFT benchmarks. We achieved a performance of over 2.8 TF for the HPC Challenge 1D FFT benchmark with our optimized all-to-all.