S. Gupta, Chua-Huang Huang, Rodney W. Johnson, P. Sadayappan
{"title":"Communication-efficient implementation of block recursive algorithms on distributed-memory machines","authors":"S. Gupta, Chua-Huang Huang, Rodney W. Johnson, P. Sadayappan","doi":"10.1109/ICPADS.1994.590060","DOIUrl":null,"url":null,"abstract":"This paper presents a design methodology for developing efficient distributed-memory parallel programs for block-recursive algorithms such as the fast Fourier transform and bitonic sort. This design methodology is specifically suited for most modern supercomputers having a distributed-memory architecture with circuit-switched or wormhole routed mesh or hypercube interconnection network. A mathematical framework based on the tenser product and other matrix operations is used for representing algorithms. Communication-efficient implementations with effectively overlapped computation and communication are achieved by manipulating the mathematical representation using the tenser algebra. Performance results for FFT programs on the Intel iPSC/860 and Intel Paragon are presented.","PeriodicalId":154429,"journal":{"name":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1994-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 1994 International Conference on Parallel and Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS.1994.590060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
This paper presents a design methodology for developing efficient distributed-memory parallel programs for block-recursive algorithms such as the fast Fourier transform and bitonic sort. This design methodology is specifically suited for most modern supercomputers having a distributed-memory architecture with circuit-switched or wormhole routed mesh or hypercube interconnection network. A mathematical framework based on the tenser product and other matrix operations is used for representing algorithms. Communication-efficient implementations with effectively overlapped computation and communication are achieved by manipulating the mathematical representation using the tenser algebra. Performance results for FFT programs on the Intel iPSC/860 and Intel Paragon are presented.