卡特彼勒:用于加速深度神经网络训练的粗粒度可重构架构

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2017-06-01 DOI:10.1109/ASAP.2017.7995252

Yuanfang Li, A. Pedram

{"title":"卡特彼勒:用于加速深度神经网络训练的粗粒度可重构架构","authors":"Yuanfang Li, A. Pedram","doi":"10.1109/ASAP.2017.7995252","DOIUrl":null,"url":null,"abstract":"Accelerating the inference of a trained DNN is a well studied subject. In this paper we switch the focus to the training of DNNs. The training phase is compute intensive, demands complicated data communication, and contains multiple levels of data dependencies and parallelism. This paper presents an algorithm/architecture space exploration of efficient accelerators to achieve better network convergence rates and higher energy efficiency for training DNNs. We further demonstrate that an architecture with hierarchical support for collective communication semantics provides flexibility in training various networks performing both stochastic and batched gradient descent based techniques. Our results suggest that smaller networks favor non-batched techniques while performance for larger networks is higher using batched operations. At 45nm technology, CATERPILLAR achieves performance efficiencies of 177 GFLOPS/W at over 80% utilization for SGD training on small networks and 211 GFLOPS/W at over 90% utilization for pipelined SGD/CP training on larger networks using a total area of 103.2 mm2 and 178.9 mm2 respectively.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"CATERPILLAR: Coarse Grain Reconfigurable Architecture for accelerating the training of Deep Neural Networks\",\"authors\":\"Yuanfang Li, A. Pedram\",\"doi\":\"10.1109/ASAP.2017.7995252\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accelerating the inference of a trained DNN is a well studied subject. In this paper we switch the focus to the training of DNNs. The training phase is compute intensive, demands complicated data communication, and contains multiple levels of data dependencies and parallelism. This paper presents an algorithm/architecture space exploration of efficient accelerators to achieve better network convergence rates and higher energy efficiency for training DNNs. We further demonstrate that an architecture with hierarchical support for collective communication semantics provides flexibility in training various networks performing both stochastic and batched gradient descent based techniques. Our results suggest that smaller networks favor non-batched techniques while performance for larger networks is higher using batched operations. At 45nm technology, CATERPILLAR achieves performance efficiencies of 177 GFLOPS/W at over 80% utilization for SGD training on small networks and 211 GFLOPS/W at over 90% utilization for pipelined SGD/CP training on larger networks using a total area of 103.2 mm2 and 178.9 mm2 respectively.\",\"PeriodicalId\":405953,\"journal\":{\"name\":\"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASAP.2017.7995252\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP.2017.7995252","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

加速训练好的深度神经网络的推理是一个很好的研究课题。在本文中，我们将重点转向dnn的训练。训练阶段是计算密集型的，需要复杂的数据通信，并且包含多层次的数据依赖性和并行性。本文提出了一种高效加速器的算法/架构空间探索，以实现更好的网络收敛速度和更高的能量效率，用于训练dnn。我们进一步证明，具有分层支持集体通信语义的架构为训练执行随机和批处理梯度下降技术的各种网络提供了灵活性。我们的结果表明，较小的网络倾向于使用非批处理技术，而使用批处理操作的大型网络的性能更高。在45纳米技术下，卡特彼勒在小型网络上的SGD训练实现了177 GFLOPS/W的性能效率，利用率超过80%;在较大网络上的流水线SGD/CP训练实现了211 GFLOPS/W的性能效率，利用率超过90%，总面积分别为103.2 mm2和178.9 mm2。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CATERPILLAR: Coarse Grain Reconfigurable Architecture for accelerating the training of Deep Neural Networks

Accelerating the inference of a trained DNN is a well studied subject. In this paper we switch the focus to the training of DNNs. The training phase is compute intensive, demands complicated data communication, and contains multiple levels of data dependencies and parallelism. This paper presents an algorithm/architecture space exploration of efficient accelerators to achieve better network convergence rates and higher energy efficiency for training DNNs. We further demonstrate that an architecture with hierarchical support for collective communication semantics provides flexibility in training various networks performing both stochastic and batched gradient descent based techniques. Our results suggest that smaller networks favor non-batched techniques while performance for larger networks is higher using batched operations. At 45nm technology, CATERPILLAR achieves performance efficiencies of 177 GFLOPS/W at over 80% utilization for SGD training on small networks and 211 GFLOPS/W at over 90% utilization for pipelined SGD/CP training on larger networks using a total area of 103.2 mm2 and 178.9 mm2 respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

自引率

0.00%

发文量