{"title":"片上存储器在三维存储器集成FPGA上实现二维FFT的高效数据布局","authors":"Shreyas G. Singapura, R. Kannan, V. Prasanna","doi":"10.1109/HPEC.2016.7761606","DOIUrl":null,"url":null,"abstract":"3D memories are becoming viable solutions for the memory wall problem and meeting the bandwidth requirements of memory intensive applications. The high bandwidth provided by 3D memories does not translate to a proportional increase in performance for all applications. For an application such as 2D FFT with strided access patterns, the data layout of the memory has a significant impact on the total execution time of the implementation. In this paper, we present a data layout for 2D FFT on 3D memory integrated FPGA that is both on-chip memory efficient as well as throughput-optimal. Our data layout ensures that consecutive accesses to 3D memory are sufficiently interleaved among layers and vaults to absorb latency due to activation overheads for both sequential (Row FFT) and strided (Column FFT) accesses. The current state-of-the-art implementation on 3D memory requires O(√cN) on-chip memory to reduce the strided accesses and achieve maximum bandwidth for an N × N FFT problem size and c columns in a 3D memory bank row. Our proposed data layout optimizes the throughput of both the Row FFT and Column FFT phases of 2D FFT with O(N) on-chip memory for the same problem size and memory parameters without decreasing the memory bandwidth thereby achieving a √c× reduction in on-chip memory. On architectures with limited on-chip memory, our data layout achieves 2× to 4× improvement in execution time compared with the state-of-art 2D FFT implementation on 3D memory.","PeriodicalId":308129,"journal":{"name":"2016 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"204 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"On-chip memory efficient data layout for 2D FFT on 3D memory integrated FPGA\",\"authors\":\"Shreyas G. Singapura, R. Kannan, V. Prasanna\",\"doi\":\"10.1109/HPEC.2016.7761606\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"3D memories are becoming viable solutions for the memory wall problem and meeting the bandwidth requirements of memory intensive applications. The high bandwidth provided by 3D memories does not translate to a proportional increase in performance for all applications. For an application such as 2D FFT with strided access patterns, the data layout of the memory has a significant impact on the total execution time of the implementation. In this paper, we present a data layout for 2D FFT on 3D memory integrated FPGA that is both on-chip memory efficient as well as throughput-optimal. Our data layout ensures that consecutive accesses to 3D memory are sufficiently interleaved among layers and vaults to absorb latency due to activation overheads for both sequential (Row FFT) and strided (Column FFT) accesses. The current state-of-the-art implementation on 3D memory requires O(√cN) on-chip memory to reduce the strided accesses and achieve maximum bandwidth for an N × N FFT problem size and c columns in a 3D memory bank row. Our proposed data layout optimizes the throughput of both the Row FFT and Column FFT phases of 2D FFT with O(N) on-chip memory for the same problem size and memory parameters without decreasing the memory bandwidth thereby achieving a √c× reduction in on-chip memory. On architectures with limited on-chip memory, our data layout achieves 2× to 4× improvement in execution time compared with the state-of-art 2D FFT implementation on 3D memory.\",\"PeriodicalId\":308129,\"journal\":{\"name\":\"2016 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"204 \",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC.2016.7761606\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2016.7761606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
摘要
3D存储器正在成为内存墙问题的可行解决方案,并满足内存密集型应用的带宽要求。3D存储器提供的高带宽并不能转化为所有应用程序性能的成比例增长。对于具有跨行访问模式的2D FFT等应用程序,内存的数据布局对实现的总执行时间有重大影响。在本文中,我们提出了一种在3D存储器集成FPGA上进行二维FFT的数据布局,该布局既具有片上存储器效率,又具有吞吐量优化。我们的数据布局确保对3D内存的连续访问在层和vault之间充分交错,以吸收由于顺序(行FFT)和跨行(列FFT)访问的激活开销而导致的延迟。当前最先进的3D存储器实现需要O(√cN)片上存储器来减少跨行访问并实现最大带宽,以满足N × N FFT问题大小和3D存储器行中的c列。我们提出的数据布局优化了具有O(N)片上存储器的2D FFT的行FFT和列FFT阶段的吞吐量,具有相同的问题大小和存储器参数,而不减少存储器带宽,从而实现了片上存储器的√cx减少。在片上内存有限的架构上,我们的数据布局与3D内存上最先进的2D FFT实现相比,执行时间提高了2到4倍。
On-chip memory efficient data layout for 2D FFT on 3D memory integrated FPGA
3D memories are becoming viable solutions for the memory wall problem and meeting the bandwidth requirements of memory intensive applications. The high bandwidth provided by 3D memories does not translate to a proportional increase in performance for all applications. For an application such as 2D FFT with strided access patterns, the data layout of the memory has a significant impact on the total execution time of the implementation. In this paper, we present a data layout for 2D FFT on 3D memory integrated FPGA that is both on-chip memory efficient as well as throughput-optimal. Our data layout ensures that consecutive accesses to 3D memory are sufficiently interleaved among layers and vaults to absorb latency due to activation overheads for both sequential (Row FFT) and strided (Column FFT) accesses. The current state-of-the-art implementation on 3D memory requires O(√cN) on-chip memory to reduce the strided accesses and achieve maximum bandwidth for an N × N FFT problem size and c columns in a 3D memory bank row. Our proposed data layout optimizes the throughput of both the Row FFT and Column FFT phases of 2D FFT with O(N) on-chip memory for the same problem size and memory parameters without decreasing the memory bandwidth thereby achieving a √c× reduction in on-chip memory. On architectures with limited on-chip memory, our data layout achieves 2× to 4× improvement in execution time compared with the state-of-art 2D FFT implementation on 3D memory.