Wavefront parallelization of recurrent neural networks on multi-core architectures

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI:10.1145/3392717.3392762

Robin Kumar Sharma, Marc Casas

{"title":"Wavefront parallelization of recurrent neural networks on multi-core architectures","authors":"Robin Kumar Sharma, Marc Casas","doi":"10.1145/3392717.3392762","DOIUrl":null,"url":null,"abstract":"Recurrent neural networks (RNNs) are widely used for natural language processing, time-series prediction, or text analysis tasks. The internal structure of RNNs inference and training in terms of data or control dependencies across their fundamental numerical kernels complicate the exploitation of model parallelism, which is the reason why just data-parallelism has been traditionally applied to accelerate RNNs. This paper presents W-Par (Wavefront-Parallelization), a comprehensive approach for RNNs inference and training on CPUs that relies on applying model parallelism into RNNs models. We use fine-grained pipeline parallelism in terms of wavefront computations to accelerate multi-layer RNNs running on multi-core CPUs. Wavefront computations have been widely applied in many scientific computing domains like stencil kernels or dynamic programming. W-Par divides RNNs workloads across different parallel tasks by defining input and output dependencies for each RNN cell. Our experiments considering different RNNs models demonstrate that W-Par achieves up to 6.6X speed-up for RNN models inference and training in comparison to current state-of-the-art implementations on modern multi-core CPU architectures. Importantly, W-Par maximizes performance on a wide range of scenarios, including different core counts or memory hierarchy configurations, without requiring any change at the source code level.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"170 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th ACM International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3392717.3392762","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Recurrent neural networks (RNNs) are widely used for natural language processing, time-series prediction, or text analysis tasks. The internal structure of RNNs inference and training in terms of data or control dependencies across their fundamental numerical kernels complicate the exploitation of model parallelism, which is the reason why just data-parallelism has been traditionally applied to accelerate RNNs. This paper presents W-Par (Wavefront-Parallelization), a comprehensive approach for RNNs inference and training on CPUs that relies on applying model parallelism into RNNs models. We use fine-grained pipeline parallelism in terms of wavefront computations to accelerate multi-layer RNNs running on multi-core CPUs. Wavefront computations have been widely applied in many scientific computing domains like stencil kernels or dynamic programming. W-Par divides RNNs workloads across different parallel tasks by defining input and output dependencies for each RNN cell. Our experiments considering different RNNs models demonstrate that W-Par achieves up to 6.6X speed-up for RNN models inference and training in comparison to current state-of-the-art implementations on modern multi-core CPU architectures. Importantly, W-Par maximizes performance on a wide range of scenarios, including different core counts or memory hierarchy configurations, without requiring any change at the source code level.

查看原文本刊更多论文

多核结构上循环神经网络的波前并行化

递归神经网络(RNNs)广泛用于自然语言处理、时间序列预测或文本分析任务。rnn的内部结构在数据或控制依赖关系方面进行推理和训练，这使得模型并行性的利用变得复杂，这就是为什么传统上只应用数据并行性来加速rnn的原因。本文提出了W-Par(波前并行化)，这是一种在cpu上进行rnn推理和训练的综合方法，它依赖于将模型并行性应用于rnn模型。我们在波前计算方面使用细粒度管道并行来加速在多核cpu上运行的多层rnn。波前计算已广泛应用于许多科学计算领域，如模板核和动态规划。W-Par通过定义每个RNN单元的输入和输出依赖关系，将RNN工作负载划分为不同的并行任务。我们考虑不同RNN模型的实验表明，与现代多核CPU架构上当前最先进的实现相比，W-Par在RNN模型推理和训练方面实现了高达6.6倍的加速。重要的是，W-Par在各种情况下(包括不同的核心计数或内存层次结构配置)都能最大限度地提高性能，而不需要在源代码级别进行任何更改。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 34th ACM International Conference on Supercomputing

自引率

0.00%

发文量