并行多线程矩阵处理器的FPGA实现与性能评价

2014 9th International Conference on Computer Engineering & Systems (ICCES) Pub Date : 2014-12-01 DOI:10.1109/ICCES.2014.7030959

M. Soliman, E. Elsayed

{"title":"并行多线程矩阵处理器的FPGA实现与性能评价","authors":"M. Soliman, E. Elsayed","doi":"10.1109/ICCES.2014.7030959","DOIUrl":null,"url":null,"abstract":"This paper proposes a simultaneous multithreaded matrix processor called SMMP to improve the performance of data-parallel applications by exploiting ILP, DLP, and TLP. In SMMP, the well-known 5-stage pipeline (baseline scalar processor) is extended to execute multi-scalar/vector/matrix instructions on unified parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. Moreover, this paper presents the implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6. In addition, the performance of SMMP is evaluated on some kernels from the basic linear algebra subprograms (BLAS). Our results show that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, speedups of 4.9, 6.09, 6.98, 8.2, 8.25, 8.72, 9.36, 11.84, and 21.57 are achieved on BLAS kernels of applying Givens rotation, scalar times vector plus another, vector addition, vector scaling, setting up Givens rotation, dot-product, matrix-vector multiplication, Euclidean length, and matrix-matrix multiplications, respectively. In conclusion, the average speedup over the baseline is 9.55 and the average speedup over complexity is 1.68.","PeriodicalId":339697,"journal":{"name":"2014 9th International Conference on Computer Engineering & Systems (ICCES)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FPGA implementation and performance evaluation of a simultaneous multithreaded matrix processor\",\"authors\":\"M. Soliman, E. Elsayed\",\"doi\":\"10.1109/ICCES.2014.7030959\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a simultaneous multithreaded matrix processor called SMMP to improve the performance of data-parallel applications by exploiting ILP, DLP, and TLP. In SMMP, the well-known 5-stage pipeline (baseline scalar processor) is extended to execute multi-scalar/vector/matrix instructions on unified parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. Moreover, this paper presents the implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6. In addition, the performance of SMMP is evaluated on some kernels from the basic linear algebra subprograms (BLAS). Our results show that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, speedups of 4.9, 6.09, 6.98, 8.2, 8.25, 8.72, 9.36, 11.84, and 21.57 are achieved on BLAS kernels of applying Givens rotation, scalar times vector plus another, vector addition, vector scaling, setting up Givens rotation, dot-product, matrix-vector multiplication, Euclidean length, and matrix-matrix multiplications, respectively. In conclusion, the average speedup over the baseline is 9.55 and the average speedup over complexity is 1.68.\",\"PeriodicalId\":339697,\"journal\":{\"name\":\"2014 9th International Conference on Computer Engineering & Systems (ICCES)\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 9th International Conference on Computer Engineering & Systems (ICCES)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCES.2014.7030959\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 9th International Conference on Computer Engineering & Systems (ICCES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCES.2014.7030959","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出一种称为SMMP的同步多线程矩阵处理器，通过利用ILP、DLP和TLP来提高数据并行应用程序的性能。在SMMP中，众所周知的5阶段流水线(基线标量处理器)被扩展到在统一的并行执行数据路径上执行多标量/矢量/矩阵指令。SMMP每个周期可以从两个线程发出四个标量指令，或者从一个线程发出四个向量/矩阵操作，其中线程中向量/矩阵指令的执行以循环方式完成。此外，本文还介绍了基于FPGA Virtex-6的VHDL实现我们提出的SMMP。此外，在基本线性代数子程序(BLAS)的核上对SMMP的性能进行了评价。结果表明，SMMP的硬件复杂度是基准标量处理器的5.68倍。然而，在应用Givens旋转、标量乘以向量加另一个、向量加法、向量缩放、设置Givens旋转、点积、矩阵-向量乘法、欧几里得长度和矩阵-矩阵乘法的BLAS内核上，分别实现了4.9、6.09、6.98、8.2、8.25、8.72、9.36、11.84和21.57的速度提升。总之，在基线上的平均加速是9.55，在复杂度上的平均加速是1.68。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FPGA implementation and performance evaluation of a simultaneous multithreaded matrix processor

This paper proposes a simultaneous multithreaded matrix processor called SMMP to improve the performance of data-parallel applications by exploiting ILP, DLP, and TLP. In SMMP, the well-known 5-stage pipeline (baseline scalar processor) is extended to execute multi-scalar/vector/matrix instructions on unified parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. Moreover, this paper presents the implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6. In addition, the performance of SMMP is evaluated on some kernels from the basic linear algebra subprograms (BLAS). Our results show that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, speedups of 4.9, 6.09, 6.98, 8.2, 8.25, 8.72, 9.36, 11.84, and 21.57 are achieved on BLAS kernels of applying Givens rotation, scalar times vector plus another, vector addition, vector scaling, setting up Givens rotation, dot-product, matrix-vector multiplication, Euclidean length, and matrix-matrix multiplications, respectively. In conclusion, the average speedup over the baseline is 9.55 and the average speedup over complexity is 1.68.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 9th International Conference on Computer Engineering & Systems (ICCES)

自引率

0.00%

发文量