- Book学术

发布求助

文献互助智能选刊最新文献

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI:10.1109/HPCA47549.2020.00036

Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, H. Li, Yiran Chen

{"title":"AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators","authors":"Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, H. Li, Yiran Chen","doi":"10.1109/HPCA47549.2020.00036","DOIUrl":null,"url":null,"abstract":"Deep neural network (DNN) accelerators as an example of domain-specific architecture have demonstrated great success in DNN inference. However, the architecture acceleration for equally important DNN training has not yet been fully studied. With data forward, error backward and gradient calculation, DNN training is a more complicated process with higher computation and communication intensity. Because the recent research demonstrates a diminishing specialization return, namely, \"accelerator wall\", we believe that a promising approach is to explore coarse-grained parallelism among multiple performance-bounded accelerators to support DNN training. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. We present AccPar, a principled and systematic method of determining the tensor partition among heterogeneous accelerator arrays. Compared to prior empirical or unsystematic methods, AccPar considers the complete tensor partition space and can reveal previously unknown new parallelism configurations. AccPar optimizes the performance based on a cost model that takes into account both computation and communication costs of a heterogeneous execution environment. Hence, our method can avoid the drawbacks of existing approaches that use communication as a proxy of the performance. The enhanced flexibility of tensor partitioning in AccPar allows the flexible ratio of computations to be distributed among accelerators with different performances. The proposed search algorithm is also applicable to the emerging multi-path patterns in modern DNNs such as ResNet. We simulate AccPar on a heterogeneous accelerator array composed of both TPU-v2 and TPU-v3 accelerators for the training of large-scale DNN models such as Alexnet, Vgg series, and Resnet series. The average performance improvements of the state-of-the-art \"one weird trick\" (OWT) and HYPAR, and AccPar, normalized to the baseline data parallelism scheme where each accelerator replicates the model and processes different input data in parallel, are 2.98×, 3.78×, and 6.30×, respectively.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA47549.2020.00036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

摘要

深度神经网络(Deep neural network, DNN)加速器作为领域特定架构的一个例子，在深度神经网络推理中取得了巨大的成功。然而，同样重要的深度神经网络训练的架构加速尚未得到充分的研究。由于数据前向、误差后向和梯度计算，DNN训练过程更加复杂，计算量和通信强度也更高。由于最近的研究表明专业化回报递减，即“加速器墙”，我们认为一个有前途的方法是探索多个性能有限的加速器之间的粗粒度并行性，以支持深度神经网络训练。然而，在多个异构加速器上分布计算以实现高吞吐量和平衡执行仍然具有挑战性。本文提出了一种确定异质加速器阵列间张量分配的系统方法AccPar。与之前的经验或非系统方法相比，AccPar考虑了完全张量划分空间，可以揭示以前未知的新的并行配置。AccPar基于成本模型优化性能，该模型考虑了异构执行环境的计算和通信成本。因此，我们的方法可以避免使用通信作为性能代理的现有方法的缺点。AccPar增强了张量划分的灵活性，使得计算的灵活比例可以在不同性能的加速器之间进行分配。本文提出的搜索算法也适用于ResNet等现代深度神经网络中出现的多路径模式。我们在一个由TPU-v2和TPU-v3组成的异构加速器阵列上模拟AccPar，用于Alexnet、Vgg系列和Resnet系列等大规模DNN模型的训练。最先进的“一个奇怪的技巧”(OWT)、HYPAR和AccPar的平均性能改进，归一化为基线数据并行方案，其中每个加速器复制模型并并行处理不同的输入数据，分别为2.98倍、3.78倍和6.30倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators

Deep neural network (DNN) accelerators as an example of domain-specific architecture have demonstrated great success in DNN inference. However, the architecture acceleration for equally important DNN training has not yet been fully studied. With data forward, error backward and gradient calculation, DNN training is a more complicated process with higher computation and communication intensity. Because the recent research demonstrates a diminishing specialization return, namely, "accelerator wall", we believe that a promising approach is to explore coarse-grained parallelism among multiple performance-bounded accelerators to support DNN training. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. We present AccPar, a principled and systematic method of determining the tensor partition among heterogeneous accelerator arrays. Compared to prior empirical or unsystematic methods, AccPar considers the complete tensor partition space and can reveal previously unknown new parallelism configurations. AccPar optimizes the performance based on a cost model that takes into account both computation and communication costs of a heterogeneous execution environment. Hence, our method can avoid the drawbacks of existing approaches that use communication as a proxy of the performance. The enhanced flexibility of tensor partitioning in AccPar allows the flexible ratio of computations to be distributed among accelerators with different performances. The proposed search algorithm is also applicable to the emerging multi-path patterns in modern DNNs such as ResNet. We simulate AccPar on a heterogeneous accelerator array composed of both TPU-v2 and TPU-v3 accelerators for the training of large-scale DNN models such as Alexnet, Vgg series, and Resnet series. The average performance improvements of the state-of-the-art "one weird trick" (OWT) and HYPAR, and AccPar, normalized to the baseline data parallelism scheme where each accelerator replicates the model and processes different input data in parallel, are 2.98×, 3.78×, and 6.30×, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量