{"title":"超立方体上二维FFT方法的比较","authors":"C. Chu","doi":"10.1145/63047.63099","DOIUrl":null,"url":null,"abstract":"Complex two-dimensional FFTs up to size 256 x 256 points are implemented on the Intel iPSC/System 286 hypercube with emphasis on comparing the effects of data mapping, data transposition or communication needs, and the use of distributed FFTs. Two new implementations of the 2D-FFT include the Local-Distributed method which performs local FFTs in one direction followed by distributed FFTs in the other direction, and a Vector-Radix implementation that is derived from decimating the DFT in two-dimensions instead of one. In addition, the Transpose-Split method involving local FFTs in both directions with an intervening matrix transposition and the Block 2D-FFT involving distributed FFT butterflies in both directions are implemented and compared with the other two methods. Timing results show that on the Intel iPSC/System 286, there is hardly any difference between the methods, with the only differences arising from the efficiency or inefficiency of communication. Since the Intel cannot overlap communication and computation, this forces the user to buffer data. In some of the methods, this causes processor blocking during communication. Issues of vectorization, communication strategies, data storage and buffering requirements are investigated. A model is given that compares vectorization and communication complexity. While timing results show that the Transpose-Split method is in general slightly faster, our model shows that the Block method and Vector-Radix method have the potential to be faster if the communication difficulties were taken care of. Therefore if communication could be “hidden” within computation, the latter two methods can become useful with the Block method vectorizing the best and the Vector-Radix method having 25% fewer multiplications than row-column 2D-FFT methods. Finally the Local-Distributed method is a good hybrid method requiring no transposing and can be useful in certain circumstances. This paper provides some general guidelines in evaluating parallel distributed 2D-FFT implementations and concludes that while different methods may be best suited for different systems, better implementation techniques as well as faster algorithms still perform better when communication become more efficient.","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"57 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Comparison of two-dimensional FFT methods on the hypercube\",\"authors\":\"C. Chu\",\"doi\":\"10.1145/63047.63099\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Complex two-dimensional FFTs up to size 256 x 256 points are implemented on the Intel iPSC/System 286 hypercube with emphasis on comparing the effects of data mapping, data transposition or communication needs, and the use of distributed FFTs. Two new implementations of the 2D-FFT include the Local-Distributed method which performs local FFTs in one direction followed by distributed FFTs in the other direction, and a Vector-Radix implementation that is derived from decimating the DFT in two-dimensions instead of one. In addition, the Transpose-Split method involving local FFTs in both directions with an intervening matrix transposition and the Block 2D-FFT involving distributed FFT butterflies in both directions are implemented and compared with the other two methods. Timing results show that on the Intel iPSC/System 286, there is hardly any difference between the methods, with the only differences arising from the efficiency or inefficiency of communication. Since the Intel cannot overlap communication and computation, this forces the user to buffer data. In some of the methods, this causes processor blocking during communication. Issues of vectorization, communication strategies, data storage and buffering requirements are investigated. A model is given that compares vectorization and communication complexity. While timing results show that the Transpose-Split method is in general slightly faster, our model shows that the Block method and Vector-Radix method have the potential to be faster if the communication difficulties were taken care of. Therefore if communication could be “hidden” within computation, the latter two methods can become useful with the Block method vectorizing the best and the Vector-Radix method having 25% fewer multiplications than row-column 2D-FFT methods. Finally the Local-Distributed method is a good hybrid method requiring no transposing and can be useful in certain circumstances. This paper provides some general guidelines in evaluating parallel distributed 2D-FFT implementations and concludes that while different methods may be best suited for different systems, better implementation techniques as well as faster algorithms still perform better when communication become more efficient.\",\"PeriodicalId\":299435,\"journal\":{\"name\":\"Conference on Hypercube Concurrent Computers and Applications\",\"volume\":\"57 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1989-01-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Conference on Hypercube Concurrent Computers and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/63047.63099\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference on Hypercube Concurrent Computers and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/63047.63099","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20
摘要
在Intel iPSC/System 286超立方体上实现了256 x 256点大小的复杂二维fft,重点是比较数据映射,数据转换或通信需求的效果,以及分布式fft的使用。2D-FFT的两种新实现包括local - distributed方法,该方法在一个方向上执行局部fft,然后在另一个方向上执行分布式fft,以及Vector-Radix实现,该实现通过在二维而不是一维中抽取DFT而派生。此外,实现了涉及两个方向的局部FFT的转置-分割方法和涉及两个方向的分布式FFT蝴蝶的Block 2D-FFT方法,并与其他两种方法进行了比较。时序结果表明,在Intel iPSC/System 286上,这两种方法之间几乎没有任何区别,唯一的区别在于通信的效率或低效率。由于英特尔不能重叠通信和计算,这迫使用户缓冲数据。在某些方法中,这会导致处理器在通信期间阻塞。对矢量化、通信策略、数据存储和缓冲要求等问题进行了研究。给出了一个比较向量化和通信复杂度的模型。虽然时序结果表明,转置-分割方法通常略快,但我们的模型表明,如果考虑到通信困难,块方法和向量-基数方法有可能更快。因此,如果通信可以在计算中“隐藏”,后两种方法可以变得有用,块方法矢量化最好,向量-基数方法比行-列2D-FFT方法乘法少25%。最后,局部分布方法是一种不需要转置的很好的混合方法,可以在某些情况下使用。本文提供了一些评估并行分布式2D-FFT实现的一般准则,并得出结论,尽管不同的方法可能最适合不同的系统,但当通信变得更有效时,更好的实现技术和更快的算法仍然表现得更好。
Comparison of two-dimensional FFT methods on the hypercube
Complex two-dimensional FFTs up to size 256 x 256 points are implemented on the Intel iPSC/System 286 hypercube with emphasis on comparing the effects of data mapping, data transposition or communication needs, and the use of distributed FFTs. Two new implementations of the 2D-FFT include the Local-Distributed method which performs local FFTs in one direction followed by distributed FFTs in the other direction, and a Vector-Radix implementation that is derived from decimating the DFT in two-dimensions instead of one. In addition, the Transpose-Split method involving local FFTs in both directions with an intervening matrix transposition and the Block 2D-FFT involving distributed FFT butterflies in both directions are implemented and compared with the other two methods. Timing results show that on the Intel iPSC/System 286, there is hardly any difference between the methods, with the only differences arising from the efficiency or inefficiency of communication. Since the Intel cannot overlap communication and computation, this forces the user to buffer data. In some of the methods, this causes processor blocking during communication. Issues of vectorization, communication strategies, data storage and buffering requirements are investigated. A model is given that compares vectorization and communication complexity. While timing results show that the Transpose-Split method is in general slightly faster, our model shows that the Block method and Vector-Radix method have the potential to be faster if the communication difficulties were taken care of. Therefore if communication could be “hidden” within computation, the latter two methods can become useful with the Block method vectorizing the best and the Vector-Radix method having 25% fewer multiplications than row-column 2D-FFT methods. Finally the Local-Distributed method is a good hybrid method requiring no transposing and can be useful in certain circumstances. This paper provides some general guidelines in evaluating parallel distributed 2D-FFT implementations and concludes that while different methods may be best suited for different systems, better implementation techniques as well as faster algorithms still perform better when communication become more efficient.