{"title":"基于FPGA的可定制高性能矩阵乘法内核(仅摘要)","authors":"Jie Wang, J. Cong","doi":"10.1145/2684746.2689147","DOIUrl":null,"url":null,"abstract":"Matrix multiplication (MM) is an important kernel in many application domains, including scientific computing, image processing, machine learning, etc. Numerous accelerator designs have been proposed for higher throughput and energy efficiency. In this paper we present a customizable FPGA accelerator of matrix multiplication. We also develop a design automation flow to generate the optimal design configuration with the highest throughput given the matrix size and target FPGA platform. It can be integrated with HLS tools as a basic parameterizable library component. Experiments show that for 512×512 single precision MM, we can achieve as high as 358 GFLOPs on the Xilinx Virtix-7 XC7VX485T-2, which outperforms any published state-of-the-art FPGA accelerator design by at least 28.3%.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Customizable and High Performance Matrix Multiplication Kernel on FPGA (Abstract Only)\",\"authors\":\"Jie Wang, J. Cong\",\"doi\":\"10.1145/2684746.2689147\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Matrix multiplication (MM) is an important kernel in many application domains, including scientific computing, image processing, machine learning, etc. Numerous accelerator designs have been proposed for higher throughput and energy efficiency. In this paper we present a customizable FPGA accelerator of matrix multiplication. We also develop a design automation flow to generate the optimal design configuration with the highest throughput given the matrix size and target FPGA platform. It can be integrated with HLS tools as a basic parameterizable library component. Experiments show that for 512×512 single precision MM, we can achieve as high as 358 GFLOPs on the Xilinx Virtix-7 XC7VX485T-2, which outperforms any published state-of-the-art FPGA accelerator design by at least 28.3%.\",\"PeriodicalId\":388546,\"journal\":{\"name\":\"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2684746.2689147\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2684746.2689147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Customizable and High Performance Matrix Multiplication Kernel on FPGA (Abstract Only)
Matrix multiplication (MM) is an important kernel in many application domains, including scientific computing, image processing, machine learning, etc. Numerous accelerator designs have been proposed for higher throughput and energy efficiency. In this paper we present a customizable FPGA accelerator of matrix multiplication. We also develop a design automation flow to generate the optimal design configuration with the highest throughput given the matrix size and target FPGA platform. It can be integrated with HLS tools as a basic parameterizable library component. Experiments show that for 512×512 single precision MM, we can achieve as high as 358 GFLOPs on the Xilinx Virtix-7 XC7VX485T-2, which outperforms any published state-of-the-art FPGA accelerator design by at least 28.3%.