{"title":"Performance Modeling of Matrix Multiplication on 3D Memory Integrated FPGA","authors":"Shreyas G. Singapura, A. Panangadan, V. Prasanna","doi":"10.1109/IPDPSW.2015.133","DOIUrl":null,"url":null,"abstract":"Recent advances in three dimensional integrated circuits have enabled vertical stacks of memory to be integrated with an FPGA layer. Such architectures enable high bandwidth and low latency access to memory which is beneficial for memory-intensive applications. We build a performance model of a representative 3D Memory Integrated FPGA architecture for matrix multiplication. We derive the peak performance of the algorithm on this model in terms of throughput and energy efficiency. We evaluate the effect of different architecture parameters on performance and identify the critical bottlenecks. The parameters include the configuration of memory layers, vaults, and Through Silicon Vias (TSVs). Our analysis indicates that memory is one of the major consumers of energy on such an architecture. We model memory activation scheduling on vaults for this application and show that it improves energy efficiency by 1.83× while maintaining a throughput of 200 GOPS/s. The 3D Memory Integrated FPGA model achieves a peak performance of 93 GOPS/J for a matrix of size 16K×16K. We also compare the peak performance of a 2D architecture with that of the 3D architecture and observe a marginal improvement in both throughput and energy efficiency. Our analysis indicates that the bottleneck is the FPGA which dominates the total computation time and energy consumption. In addition to matrix multiplication, which requires O (m3) amount of computation work to be done, we also analyzed the class of applications which require O (m2) work. In particular, for matrix transposition we found out that the improvement is of the order 3× for energy consumption and 7× in runtime. This indicates that the computation cost of the application must match the memory access time in order to exploit the large bandwidth of 3D memory.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2015.133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Recent advances in three dimensional integrated circuits have enabled vertical stacks of memory to be integrated with an FPGA layer. Such architectures enable high bandwidth and low latency access to memory which is beneficial for memory-intensive applications. We build a performance model of a representative 3D Memory Integrated FPGA architecture for matrix multiplication. We derive the peak performance of the algorithm on this model in terms of throughput and energy efficiency. We evaluate the effect of different architecture parameters on performance and identify the critical bottlenecks. The parameters include the configuration of memory layers, vaults, and Through Silicon Vias (TSVs). Our analysis indicates that memory is one of the major consumers of energy on such an architecture. We model memory activation scheduling on vaults for this application and show that it improves energy efficiency by 1.83× while maintaining a throughput of 200 GOPS/s. The 3D Memory Integrated FPGA model achieves a peak performance of 93 GOPS/J for a matrix of size 16K×16K. We also compare the peak performance of a 2D architecture with that of the 3D architecture and observe a marginal improvement in both throughput and energy efficiency. Our analysis indicates that the bottleneck is the FPGA which dominates the total computation time and energy consumption. In addition to matrix multiplication, which requires O (m3) amount of computation work to be done, we also analyzed the class of applications which require O (m2) work. In particular, for matrix transposition we found out that the improvement is of the order 3× for energy consumption and 7× in runtime. This indicates that the computation cost of the application must match the memory access time in order to exploit the large bandwidth of 3D memory.