基于fpga的矩阵乘法M4RM在GF(2)上的实现

18th International Symposium on VLSI Design and Test Pub Date : 2014-08-21 DOI:10.1109/ISVDAT.2014.6881072

Vivek Kumar, Vinay B. Y. Kumar, S. Patkar

{"title":"基于fpga的矩阵乘法M4RM在GF(2)上的实现","authors":"Vivek Kumar, Vinay B. Y. Kumar, S. Patkar","doi":"10.1109/ISVDAT.2014.6881072","DOIUrl":null,"url":null,"abstract":"The Method of Four Russians for Multiplication (M4RM) is one of the most efficient algorithms for dense matrix multiplication over binary field targeting particularly the commodity general purpose processors. We present an efficient tile-based hardware/software implementation of M4RM, with the hardware side handling the constituent block multiplications in a streaming fashion, and the software side doing the accumulations. With designs for 64 × 64 and 128 × 128 sized block matrix multiplications, sizes feasible for targeting FPGAs, we compare the performance with the fastest software implementations of M4RM on commodity processors. The designs were implemented in Bluespec SystemVerilog, and evaluated over the hardware/software co-emulation framework, SCE-MI. Using the 128 × 128 hardware modules, a 16, 384 × 16, 384 matrix multiplication, running at 140 MHz could be done in ~ 3.0s using the Strassen-Winograd scheme when targeting a Cyclone IV FPGA and at a sustained bit operations per cycle of ~ 8000; where, in comparision, M4RM on Intel Core2Duo running at 2.33GHz, takes ~ 8s and at a sustained bit operations per cycle of ~ 500.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"FPGA-based implementation of M4RM for matrix multiplication over GF(2)\",\"authors\":\"Vivek Kumar, Vinay B. Y. Kumar, S. Patkar\",\"doi\":\"10.1109/ISVDAT.2014.6881072\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Method of Four Russians for Multiplication (M4RM) is one of the most efficient algorithms for dense matrix multiplication over binary field targeting particularly the commodity general purpose processors. We present an efficient tile-based hardware/software implementation of M4RM, with the hardware side handling the constituent block multiplications in a streaming fashion, and the software side doing the accumulations. With designs for 64 × 64 and 128 × 128 sized block matrix multiplications, sizes feasible for targeting FPGAs, we compare the performance with the fastest software implementations of M4RM on commodity processors. The designs were implemented in Bluespec SystemVerilog, and evaluated over the hardware/software co-emulation framework, SCE-MI. Using the 128 × 128 hardware modules, a 16, 384 × 16, 384 matrix multiplication, running at 140 MHz could be done in ~ 3.0s using the Strassen-Winograd scheme when targeting a Cyclone IV FPGA and at a sustained bit operations per cycle of ~ 8000; where, in comparision, M4RM on Intel Core2Duo running at 2.33GHz, takes ~ 8s and at a sustained bit operations per cycle of ~ 500.\",\"PeriodicalId\":217280,\"journal\":{\"name\":\"18th International Symposium on VLSI Design and Test\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"18th International Symposium on VLSI Design and Test\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISVDAT.2014.6881072\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"18th International Symposium on VLSI Design and Test","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISVDAT.2014.6881072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

四俄式乘法(M4RM)是针对商品通用处理器在二进制域上进行密集矩阵乘法的最有效算法之一。我们提出了一种高效的基于tile的M4RM硬件/软件实现，硬件端以流方式处理组成块乘法，而软件端进行累积。采用64 × 64和128 × 128大小的块矩阵乘法设计，尺寸适用于fpga，我们将性能与M4RM在商用处理器上最快的软件实现进行比较。该设计在Bluespec SystemVerilog中实现，并在硬件/软件协同仿真框架SCE-MI上进行了评估。使用128 × 128硬件模块，当针对Cyclone IV FPGA时，使用Strassen-Winograd方案可以在~ 3.0秒内完成运行在140 MHz的16,384 × 16,384矩阵乘法，并且每个周期的持续位操作为~ 8000;相比之下，运行在2.33GHz的Intel Core2Duo上的M4RM需要约8秒，并且每个周期的持续位操作约为500。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FPGA-based implementation of M4RM for matrix multiplication over GF(2)

The Method of Four Russians for Multiplication (M4RM) is one of the most efficient algorithms for dense matrix multiplication over binary field targeting particularly the commodity general purpose processors. We present an efficient tile-based hardware/software implementation of M4RM, with the hardware side handling the constituent block multiplications in a streaming fashion, and the software side doing the accumulations. With designs for 64 × 64 and 128 × 128 sized block matrix multiplications, sizes feasible for targeting FPGAs, we compare the performance with the fastest software implementations of M4RM on commodity processors. The designs were implemented in Bluespec SystemVerilog, and evaluated over the hardware/software co-emulation framework, SCE-MI. Using the 128 × 128 hardware modules, a 16, 384 × 16, 384 matrix multiplication, running at 140 MHz could be done in ~ 3.0s using the Strassen-Winograd scheme when targeting a Cyclone IV FPGA and at a sustained bit operations per cycle of ~ 8000; where, in comparision, M4RM on Intel Core2Duo running at 2.33GHz, takes ~ 8s and at a sustained bit operations per cycle of ~ 500.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

18th International Symposium on VLSI Design and Test

自引率

0.00%

发文量