Performance Modeling of Matrix Multiplication on 3D Memory Integrated FPGA

2015 IEEE International Parallel and Distributed Processing Symposium Workshop Pub Date : 2015-05-25 DOI:10.1109/IPDPSW.2015.133

Shreyas G. Singapura, A. Panangadan, V. Prasanna

{"title":"Performance Modeling of Matrix Multiplication on 3D Memory Integrated FPGA","authors":"Shreyas G. Singapura, A. Panangadan, V. Prasanna","doi":"10.1109/IPDPSW.2015.133","DOIUrl":null,"url":null,"abstract":"Recent advances in three dimensional integrated circuits have enabled vertical stacks of memory to be integrated with an FPGA layer. Such architectures enable high bandwidth and low latency access to memory which is beneficial for memory-intensive applications. We build a performance model of a representative 3D Memory Integrated FPGA architecture for matrix multiplication. We derive the peak performance of the algorithm on this model in terms of throughput and energy efficiency. We evaluate the effect of different architecture parameters on performance and identify the critical bottlenecks. The parameters include the configuration of memory layers, vaults, and Through Silicon Vias (TSVs). Our analysis indicates that memory is one of the major consumers of energy on such an architecture. We model memory activation scheduling on vaults for this application and show that it improves energy efficiency by 1.83× while maintaining a throughput of 200 GOPS/s. The 3D Memory Integrated FPGA model achieves a peak performance of 93 GOPS/J for a matrix of size 16K×16K. We also compare the peak performance of a 2D architecture with that of the 3D architecture and observe a marginal improvement in both throughput and energy efficiency. Our analysis indicates that the bottleneck is the FPGA which dominates the total computation time and energy consumption. In addition to matrix multiplication, which requires O (m3) amount of computation work to be done, we also analyzed the class of applications which require O (m2) work. In particular, for matrix transposition we found out that the improvement is of the order 3× for energy consumption and 7× in runtime. This indicates that the computation cost of the application must match the memory access time in order to exploit the large bandwidth of 3D memory.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2015.133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Recent advances in three dimensional integrated circuits have enabled vertical stacks of memory to be integrated with an FPGA layer. Such architectures enable high bandwidth and low latency access to memory which is beneficial for memory-intensive applications. We build a performance model of a representative 3D Memory Integrated FPGA architecture for matrix multiplication. We derive the peak performance of the algorithm on this model in terms of throughput and energy efficiency. We evaluate the effect of different architecture parameters on performance and identify the critical bottlenecks. The parameters include the configuration of memory layers, vaults, and Through Silicon Vias (TSVs). Our analysis indicates that memory is one of the major consumers of energy on such an architecture. We model memory activation scheduling on vaults for this application and show that it improves energy efficiency by 1.83× while maintaining a throughput of 200 GOPS/s. The 3D Memory Integrated FPGA model achieves a peak performance of 93 GOPS/J for a matrix of size 16K×16K. We also compare the peak performance of a 2D architecture with that of the 3D architecture and observe a marginal improvement in both throughput and energy efficiency. Our analysis indicates that the bottleneck is the FPGA which dominates the total computation time and energy consumption. In addition to matrix multiplication, which requires O (m3) amount of computation work to be done, we also analyzed the class of applications which require O (m2) work. In particular, for matrix transposition we found out that the improvement is of the order 3× for energy consumption and 7× in runtime. This indicates that the computation cost of the application must match the memory access time in order to exploit the large bandwidth of 3D memory.

查看原文本刊更多论文

三维存储器集成FPGA上矩阵乘法的性能建模

三维集成电路的最新进展使垂直存储器堆栈能够与FPGA层集成。这种架构支持对内存的高带宽和低延迟访问，这对内存密集型应用程序是有益的。我们建立了一个具有代表性的三维存储器集成FPGA架构的性能模型，用于矩阵乘法。我们从吞吐量和能源效率方面推导了该算法在该模型上的峰值性能。我们评估了不同架构参数对性能的影响，并确定了关键瓶颈。这些参数包括内存层、vault和Through Silicon Vias (tsv)的配置。我们的分析表明，在这种架构中，内存是主要的能源消耗者之一。我们为这个应用程序对vault上的内存激活调度进行了建模，并表明它在保持200 GOPS/s的吞吐量的同时将能源效率提高了1.83倍。3D存储器集成FPGA模型在大小为16K×16K的矩阵中实现了93 GOPS/J的峰值性能。我们还比较了2D架构与3D架构的峰值性能，并观察到吞吐量和能源效率的边际改进。我们的分析表明，瓶颈是FPGA，它主导了总计算时间和能量消耗。除了需要O (m3)计算量的矩阵乘法外，我们还分析了需要O (m2)计算量的应用类别。特别是，对于矩阵的变换，我们发现能耗的改进是3倍，运行时间的改进是7倍。这表明，为了利用3D存储器的大带宽，应用程序的计算成本必须与内存访问时间相匹配。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

自引率

0.00%

发文量