{"title":"Customizable and High Performance Matrix Multiplication Kernel on FPGA (Abstract Only)","authors":"Jie Wang, J. Cong","doi":"10.1145/2684746.2689147","DOIUrl":null,"url":null,"abstract":"Matrix multiplication (MM) is an important kernel in many application domains, including scientific computing, image processing, machine learning, etc. Numerous accelerator designs have been proposed for higher throughput and energy efficiency. In this paper we present a customizable FPGA accelerator of matrix multiplication. We also develop a design automation flow to generate the optimal design configuration with the highest throughput given the matrix size and target FPGA platform. It can be integrated with HLS tools as a basic parameterizable library component. Experiments show that for 512×512 single precision MM, we can achieve as high as 358 GFLOPs on the Xilinx Virtix-7 XC7VX485T-2, which outperforms any published state-of-the-art FPGA accelerator design by at least 28.3%.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2684746.2689147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Matrix multiplication (MM) is an important kernel in many application domains, including scientific computing, image processing, machine learning, etc. Numerous accelerator designs have been proposed for higher throughput and energy efficiency. In this paper we present a customizable FPGA accelerator of matrix multiplication. We also develop a design automation flow to generate the optimal design configuration with the highest throughput given the matrix size and target FPGA platform. It can be integrated with HLS tools as a basic parameterizable library component. Experiments show that for 512×512 single precision MM, we can achieve as high as 358 GFLOPs on the Xilinx Virtix-7 XC7VX485T-2, which outperforms any published state-of-the-art FPGA accelerator design by at least 28.3%.