A highly efficient implementation of back propagation algorithm using matrix instruction set architecture

Neural Parallel Sci. Comput. Pub Date : 2007-03-01 DOI:10.5555/1315424.1315425

M. Soliman, S. Mohamed

引用次数: 0

Abstract

Back Propagation (BP) training algorithm has received intensive research efforts to exploit its parallelism in order to reduce the training time for complex problems. A modified version of BP based on matrix-matrix multiplication was proposed for parallel processing. This paper discusses the implementation of Matrix Back Propagation (MBP) using scalar, vector, and matrix instruction set architecture (ISA). Besides, it shows that the performance of the MBP is improved by switching form scalar to vector ISA and form vector to matrix ISA. On a practical application, speech recognition, the speedup of training a neural network using unrolling scalar over scalar ISA is 1.83. On eight parallel lanes, the speedup of using vector, unrolling vector, and matrix ISA are respectively 10.33, 11.88, and 15.36, where the maximum theoretical speedup is 16. Our results show that the use of matrix ISA gives a performance close to the optimal because of reusing the loaded data, decreasing the loop overhead, and overlapping the memory operations by arithmetic operations.

查看原文本刊更多论文

基于矩阵指令集架构的反向传播算法的高效实现

为了减少复杂问题的训练时间，反向传播(BP)训练算法得到了广泛的研究。提出了一种基于矩阵-矩阵乘法的改进BP算法，用于并行处理。本文讨论了使用标量、矢量和矩阵指令集体系结构(ISA)实现矩阵反向传播(MBP)。此外，从标量转换为矢量ISA，从矢量转换为矩阵ISA，可以提高MBP的性能。在语音识别的实际应用中，使用展开标量训练神经网络在标量ISA上的加速为1.83。在8条平行车道上，使用矢量、展开矢量和矩阵ISA的加速分别为10.33、11.88和15.36，其中最大理论加速为16。我们的结果表明，由于重用加载的数据、减少循环开销以及通过算术运算重叠内存操作，使用矩阵ISA提供了接近最优的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Parallel Sci. Comput.

自引率

0.00%

发文量