Jiang Jiang, Vincent Mirian, Kam Pui Tang, P. Chow, Zuocheng Xing
{"title":"基于可扩展宏流水线FPGA加速架构的矩阵乘法","authors":"Jiang Jiang, Vincent Mirian, Kam Pui Tang, P. Chow, Zuocheng Xing","doi":"10.1109/ReConFig.2009.30","DOIUrl":null,"url":null,"abstract":"In this paper, we introduce a scalable macro-pipelined architecture to perform floating point matrix multiplication, which aims to exploit temporal parallelism and architectural scalability. We demonstrate the functionality of the hardware design with 16 processing elements (PEs) on Xilinx ML507 development board containing Virtex-5 XC5VFX70T. A 32-PE design for matrix size ranging from 32*32 to 1024*1024 is also simulated. Our experiment shows that we have achieved 12.18 GFLOPS with 32 PEs or about 1.90 GFLOPS per PE per GHz performance, which is over 95% PE usage. Moreover, the proposed SMPA has the capability to scale up to tens or hundreds of GFLOPS using multiple FPGA devices and high speed interconnect.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":"481 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture\",\"authors\":\"Jiang Jiang, Vincent Mirian, Kam Pui Tang, P. Chow, Zuocheng Xing\",\"doi\":\"10.1109/ReConFig.2009.30\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we introduce a scalable macro-pipelined architecture to perform floating point matrix multiplication, which aims to exploit temporal parallelism and architectural scalability. We demonstrate the functionality of the hardware design with 16 processing elements (PEs) on Xilinx ML507 development board containing Virtex-5 XC5VFX70T. A 32-PE design for matrix size ranging from 32*32 to 1024*1024 is also simulated. Our experiment shows that we have achieved 12.18 GFLOPS with 32 PEs or about 1.90 GFLOPS per PE per GHz performance, which is over 95% PE usage. Moreover, the proposed SMPA has the capability to scale up to tens or hundreds of GFLOPS using multiple FPGA devices and high speed interconnect.\",\"PeriodicalId\":325631,\"journal\":{\"name\":\"2009 International Conference on Reconfigurable Computing and FPGAs\",\"volume\":\"481 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 International Conference on Reconfigurable Computing and FPGAs\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ReConFig.2009.30\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Reconfigurable Computing and FPGAs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReConFig.2009.30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture
In this paper, we introduce a scalable macro-pipelined architecture to perform floating point matrix multiplication, which aims to exploit temporal parallelism and architectural scalability. We demonstrate the functionality of the hardware design with 16 processing elements (PEs) on Xilinx ML507 development board containing Virtex-5 XC5VFX70T. A 32-PE design for matrix size ranging from 32*32 to 1024*1024 is also simulated. Our experiment shows that we have achieved 12.18 GFLOPS with 32 PEs or about 1.90 GFLOPS per PE per GHz performance, which is over 95% PE usage. Moreover, the proposed SMPA has the capability to scale up to tens or hundreds of GFLOPS using multiple FPGA devices and high speed interconnect.