{"title":"基于fpga的矩阵乘法M4RM在GF(2)上的实现","authors":"Vivek Kumar, Vinay B. Y. Kumar, S. Patkar","doi":"10.1109/ISVDAT.2014.6881072","DOIUrl":null,"url":null,"abstract":"The Method of Four Russians for Multiplication (M4RM) is one of the most efficient algorithms for dense matrix multiplication over binary field targeting particularly the commodity general purpose processors. We present an efficient tile-based hardware/software implementation of M4RM, with the hardware side handling the constituent block multiplications in a streaming fashion, and the software side doing the accumulations. With designs for 64 × 64 and 128 × 128 sized block matrix multiplications, sizes feasible for targeting FPGAs, we compare the performance with the fastest software implementations of M4RM on commodity processors. The designs were implemented in Bluespec SystemVerilog, and evaluated over the hardware/software co-emulation framework, SCE-MI. Using the 128 × 128 hardware modules, a 16, 384 × 16, 384 matrix multiplication, running at 140 MHz could be done in ~ 3.0s using the Strassen-Winograd scheme when targeting a Cyclone IV FPGA and at a sustained bit operations per cycle of ~ 8000; where, in comparision, M4RM on Intel Core2Duo running at 2.33GHz, takes ~ 8s and at a sustained bit operations per cycle of ~ 500.","PeriodicalId":217280,"journal":{"name":"18th International Symposium on VLSI Design and Test","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"FPGA-based implementation of M4RM for matrix multiplication over GF(2)\",\"authors\":\"Vivek Kumar, Vinay B. Y. Kumar, S. Patkar\",\"doi\":\"10.1109/ISVDAT.2014.6881072\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Method of Four Russians for Multiplication (M4RM) is one of the most efficient algorithms for dense matrix multiplication over binary field targeting particularly the commodity general purpose processors. We present an efficient tile-based hardware/software implementation of M4RM, with the hardware side handling the constituent block multiplications in a streaming fashion, and the software side doing the accumulations. With designs for 64 × 64 and 128 × 128 sized block matrix multiplications, sizes feasible for targeting FPGAs, we compare the performance with the fastest software implementations of M4RM on commodity processors. The designs were implemented in Bluespec SystemVerilog, and evaluated over the hardware/software co-emulation framework, SCE-MI. Using the 128 × 128 hardware modules, a 16, 384 × 16, 384 matrix multiplication, running at 140 MHz could be done in ~ 3.0s using the Strassen-Winograd scheme when targeting a Cyclone IV FPGA and at a sustained bit operations per cycle of ~ 8000; where, in comparision, M4RM on Intel Core2Duo running at 2.33GHz, takes ~ 8s and at a sustained bit operations per cycle of ~ 500.\",\"PeriodicalId\":217280,\"journal\":{\"name\":\"18th International Symposium on VLSI Design and Test\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"18th International Symposium on VLSI Design and Test\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISVDAT.2014.6881072\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"18th International Symposium on VLSI Design and Test","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISVDAT.2014.6881072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
FPGA-based implementation of M4RM for matrix multiplication over GF(2)
The Method of Four Russians for Multiplication (M4RM) is one of the most efficient algorithms for dense matrix multiplication over binary field targeting particularly the commodity general purpose processors. We present an efficient tile-based hardware/software implementation of M4RM, with the hardware side handling the constituent block multiplications in a streaming fashion, and the software side doing the accumulations. With designs for 64 × 64 and 128 × 128 sized block matrix multiplications, sizes feasible for targeting FPGAs, we compare the performance with the fastest software implementations of M4RM on commodity processors. The designs were implemented in Bluespec SystemVerilog, and evaluated over the hardware/software co-emulation framework, SCE-MI. Using the 128 × 128 hardware modules, a 16, 384 × 16, 384 matrix multiplication, running at 140 MHz could be done in ~ 3.0s using the Strassen-Winograd scheme when targeting a Cyclone IV FPGA and at a sustained bit operations per cycle of ~ 8000; where, in comparision, M4RM on Intel Core2Duo running at 2.33GHz, takes ~ 8s and at a sustained bit operations per cycle of ~ 500.