{"title":"On-chip memory efficient data layout for 2D FFT on 3D memory integrated FPGA","authors":"Shreyas G. Singapura, R. Kannan, V. Prasanna","doi":"10.1109/HPEC.2016.7761606","DOIUrl":null,"url":null,"abstract":"3D memories are becoming viable solutions for the memory wall problem and meeting the bandwidth requirements of memory intensive applications. The high bandwidth provided by 3D memories does not translate to a proportional increase in performance for all applications. For an application such as 2D FFT with strided access patterns, the data layout of the memory has a significant impact on the total execution time of the implementation. In this paper, we present a data layout for 2D FFT on 3D memory integrated FPGA that is both on-chip memory efficient as well as throughput-optimal. Our data layout ensures that consecutive accesses to 3D memory are sufficiently interleaved among layers and vaults to absorb latency due to activation overheads for both sequential (Row FFT) and strided (Column FFT) accesses. The current state-of-the-art implementation on 3D memory requires O(√cN) on-chip memory to reduce the strided accesses and achieve maximum bandwidth for an N × N FFT problem size and c columns in a 3D memory bank row. Our proposed data layout optimizes the throughput of both the Row FFT and Column FFT phases of 2D FFT with O(N) on-chip memory for the same problem size and memory parameters without decreasing the memory bandwidth thereby achieving a √c× reduction in on-chip memory. On architectures with limited on-chip memory, our data layout achieves 2× to 4× improvement in execution time compared with the state-of-art 2D FFT implementation on 3D memory.","PeriodicalId":308129,"journal":{"name":"2016 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"204 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2016.7761606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
3D memories are becoming viable solutions for the memory wall problem and meeting the bandwidth requirements of memory intensive applications. The high bandwidth provided by 3D memories does not translate to a proportional increase in performance for all applications. For an application such as 2D FFT with strided access patterns, the data layout of the memory has a significant impact on the total execution time of the implementation. In this paper, we present a data layout for 2D FFT on 3D memory integrated FPGA that is both on-chip memory efficient as well as throughput-optimal. Our data layout ensures that consecutive accesses to 3D memory are sufficiently interleaved among layers and vaults to absorb latency due to activation overheads for both sequential (Row FFT) and strided (Column FFT) accesses. The current state-of-the-art implementation on 3D memory requires O(√cN) on-chip memory to reduce the strided accesses and achieve maximum bandwidth for an N × N FFT problem size and c columns in a 3D memory bank row. Our proposed data layout optimizes the throughput of both the Row FFT and Column FFT phases of 2D FFT with O(N) on-chip memory for the same problem size and memory parameters without decreasing the memory bandwidth thereby achieving a √c× reduction in on-chip memory. On architectures with limited on-chip memory, our data layout achieves 2× to 4× improvement in execution time compared with the state-of-art 2D FFT implementation on 3D memory.