Impact of the Array Shape and Memory Bandwidth on the Execution Time of CNN Systolic Arrays

2020 23rd Euromicro Conference on Digital System Design (DSD) Pub Date : 2020-08-01 DOI:10.1109/DSD51259.2020.00086

Eduardo Yago, Pau Castelló, S. Petit, M. E. Gómez, J. Sahuquillo

{"title":"Impact of the Array Shape and Memory Bandwidth on the Execution Time of CNN Systolic Arrays","authors":"Eduardo Yago, Pau Castelló, S. Petit, M. E. Gómez, J. Sahuquillo","doi":"10.1109/DSD51259.2020.00086","DOIUrl":null,"url":null,"abstract":"The use of Convolutional Neural Networks (CNN) has experienced a huge rise over the last recent years and its popularity has increased exponentially, mainly due to its application both for image recognition and certain applications related to artificial intelligence. The new applications of CNN request computing demands that are difficult to address by conventional processors.As a consequence, accelerators –both prototypes and commercial products– focusing on CNN computation have been proposed. Among these accelerators, those based on systolic arrays have acquired a special relevance; some examples are the Google’s TPU and Eyeriss.Current research has focused on regular squared systolic arrays and most existing work assumes that there is enough memory bandwidth to feed the systolic array with input data. In this paper we explore the design of non-squared systolic arrays and address the impact of the memory bandwidth from a performance perspective.This work makes two main contributions. First, we found that some workloads with non-squared arrays achieve similar performance to systolic arrays twice as large, which can translate in area and/or energy benefits.Second, we present a performance comparison varying the main memory bandwidth for current DRAM devices. The analysis reveals that main memory bandwidth has a great impact on performance and that the decision of which technology use is key for the system performance. For the 64x64 array size it is necessary to use HBM2 memory to avoid the slowdown that would introduce cheaper technologies (e.g. DDR5 and DDR4).","PeriodicalId":128527,"journal":{"name":"2020 23rd Euromicro Conference on Digital System Design (DSD)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 23rd Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD51259.2020.00086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The use of Convolutional Neural Networks (CNN) has experienced a huge rise over the last recent years and its popularity has increased exponentially, mainly due to its application both for image recognition and certain applications related to artificial intelligence. The new applications of CNN request computing demands that are difficult to address by conventional processors.As a consequence, accelerators –both prototypes and commercial products– focusing on CNN computation have been proposed. Among these accelerators, those based on systolic arrays have acquired a special relevance; some examples are the Google’s TPU and Eyeriss.Current research has focused on regular squared systolic arrays and most existing work assumes that there is enough memory bandwidth to feed the systolic array with input data. In this paper we explore the design of non-squared systolic arrays and address the impact of the memory bandwidth from a performance perspective.This work makes two main contributions. First, we found that some workloads with non-squared arrays achieve similar performance to systolic arrays twice as large, which can translate in area and/or energy benefits.Second, we present a performance comparison varying the main memory bandwidth for current DRAM devices. The analysis reveals that main memory bandwidth has a great impact on performance and that the decision of which technology use is key for the system performance. For the 64x64 array size it is necessary to use HBM2 memory to avoid the slowdown that would introduce cheaper technologies (e.g. DDR5 and DDR4).

查看原文本刊更多论文

阵列形状和内存带宽对CNN收缩阵列执行时间的影响

卷积神经网络(CNN)的使用在过去几年里经历了巨大的增长，其受欢迎程度呈指数级增长，主要是由于它在图像识别和某些与人工智能相关的应用中的应用。CNN的新应用要求传统处理器难以满足的计算需求。因此，已经提出了专注于CNN计算的加速器(包括原型和商业产品)。在这些加速器中，基于收缩阵列的加速器具有特殊的相关性;例如谷歌的TPU和Eyeriss。目前的研究主要集中在正则平方收缩阵列上，大多数现有的工作都假设有足够的内存带宽来为收缩阵列提供输入数据。在本文中，我们探讨了非平方收缩阵列的设计，并从性能角度解决了内存带宽的影响。这项工作有两个主要贡献。首先，我们发现一些使用非平方阵列的工作负载的性能与收缩阵列相似，收缩阵列是收缩阵列的两倍，这可以转化为面积和/或能量效益。其次，我们对当前DRAM设备的主存储器带宽进行了性能比较。分析表明，主存带宽对系统性能的影响很大，选用何种技术是决定系统性能好坏的关键。对于64x64阵列大小，有必要使用HBM2内存以避免引入更便宜的技术(例如DDR5和DDR4)的减速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 23rd Euromicro Conference on Digital System Design (DSD)

自引率

0.00%

发文量