Core Placement Optimization for Multi-chip Many-core Neural Network Systems with Reinforcement Learning

ACM Transactions on Design Automation of Electronic Systems (TODAES) Pub Date : 2020-10-19 DOI:10.1145/3418498

Nan Wu, Lei Deng, Guoqi Li, Yuan Xie

{"title":"Core Placement Optimization for Multi-chip Many-core Neural Network Systems with Reinforcement Learning","authors":"Nan Wu, Lei Deng, Guoqi Li, Yuan Xie","doi":"10.1145/3418498","DOIUrl":null,"url":null,"abstract":"Multi-chip many-core neural network systems are capable of providing high parallelism benefited from decentralized execution, and they can be scaled to very large systems with reasonable fabrication costs. As multi-chip many-core systems scale up, communication latency related effects will take a more important portion in the system performance. While previous work mainly focuses on the core placement within a single chip, there are two principal issues still unresolved: the communication-related problems caused by the non-uniform, hierarchical on/off-chip communication capability in multi-chip systems, and the scalability of these heuristic-based approaches in a factorially growing search space. To this end, we propose a reinforcement-learning-based method to automatically optimize core placement through deep deterministic policy gradient, taking into account information of the environment by performing a series of trials (i.e., placements) and using convolutional neural networks to extract spatial features of different placements. Experimental results indicate that compared with a naive sequential placement, the proposed method achieves 1.99× increase in throughput and 50.5% reduction in latency; compared with the simulated annealing, an effective technique to approximate the global optima in an extremely large search space, our method improves the throughput by 1.22× and reduces the latency by 18.6%. We further demonstrate that our proposed method is capable to find optimal placements taking advantages of different communication properties caused by different system configurations, and work in a topology-agnostic manner.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"8 1","pages":"1 - 27"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3418498","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Multi-chip many-core neural network systems are capable of providing high parallelism benefited from decentralized execution, and they can be scaled to very large systems with reasonable fabrication costs. As multi-chip many-core systems scale up, communication latency related effects will take a more important portion in the system performance. While previous work mainly focuses on the core placement within a single chip, there are two principal issues still unresolved: the communication-related problems caused by the non-uniform, hierarchical on/off-chip communication capability in multi-chip systems, and the scalability of these heuristic-based approaches in a factorially growing search space. To this end, we propose a reinforcement-learning-based method to automatically optimize core placement through deep deterministic policy gradient, taking into account information of the environment by performing a series of trials (i.e., placements) and using convolutional neural networks to extract spatial features of different placements. Experimental results indicate that compared with a naive sequential placement, the proposed method achieves 1.99× increase in throughput and 50.5% reduction in latency; compared with the simulated annealing, an effective technique to approximate the global optima in an extremely large search space, our method improves the throughput by 1.22× and reduces the latency by 18.6%. We further demonstrate that our proposed method is capable to find optimal placements taking advantages of different communication properties caused by different system configurations, and work in a topology-agnostic manner.

查看原文本刊更多论文

基于强化学习的多芯片多核神经网络系统的核心布局优化

多芯片多核神经网络系统得益于分散执行，能够提供高并行性，并且可以以合理的制造成本扩展到非常大的系统。随着多芯片多核系统规模的扩大，与通信延迟相关的影响将在系统性能中占有越来越重要的地位。虽然以前的工作主要集中在单个芯片内的核心放置，但仍有两个主要问题尚未解决:由多芯片系统中不统一的、分层的片上/片外通信能力引起的通信相关问题，以及这些基于启发式方法在阶式增长的搜索空间中的可扩展性。为此，我们提出了一种基于强化学习的方法，通过深度确定性策略梯度自动优化核心位置，通过执行一系列试验(即放置)来考虑环境信息，并使用卷积神经网络提取不同放置的空间特征。实验结果表明，与单纯的顺序放置方法相比，该方法的吞吐量提高了1.99倍，延迟降低了50.5%;与模拟退火算法相比，该方法提高了1.22倍的吞吐量，减少了18.6%的延迟。模拟退火算法是一种在极大的搜索空间中近似全局最优的有效技术。我们进一步证明，我们提出的方法能够利用不同系统配置引起的不同通信属性找到最佳位置，并且以拓扑无关的方式工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Design Automation of Electronic Systems (TODAES)

自引率

0.00%

发文量