Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI:10.1109/FCCM.2017.47

Yongming Shen, M. Ferdman, Peter Milder

{"title":"Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer","authors":"Yongming Shen, M. Ferdman, Peter Milder","doi":"10.1109/FCCM.2017.47","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. Interest in CNNs has led to the design of CNN accelerators to improve CNN evaluation throughput and efficiency. Importantly, the bandwidth demand from weight data transfer for modern large CNNs causes CNN accelerators to be severely bandwidth bottlenecked, prompting the need for processing images in batches to increase weight reuse. However, existing CNN accelerator designs limit the choice of batch sizes and lack support for batch processing of convolutional layers. We observe that, for a given storage budget, choosing the best batch size requires balancing the input and weight transfer. We propose Escher, a CNN accelerator with a flexible data buffering scheme that ensures a balance between the input and weight transfer bandwidth, significantly reducing overall bandwidth requirements. For example, compared to the state-of-the-art CNN accelerator designs targeting a Virtex-7 690T FPGA, Escher reduces the accelerator peak bandwidth requirements by 2.4x across both fully-connected and convolutional layers on fixed-point AlexNet, and reduces convolutional layer bandwidth by up to 10.5x on fixed-point GoogleNet.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"95","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2017.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 95

Abstract

Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. Interest in CNNs has led to the design of CNN accelerators to improve CNN evaluation throughput and efficiency. Importantly, the bandwidth demand from weight data transfer for modern large CNNs causes CNN accelerators to be severely bandwidth bottlenecked, prompting the need for processing images in batches to increase weight reuse. However, existing CNN accelerator designs limit the choice of batch sizes and lack support for batch processing of convolutional layers. We observe that, for a given storage budget, choosing the best batch size requires balancing the input and weight transfer. We propose Escher, a CNN accelerator with a flexible data buffering scheme that ensures a balance between the input and weight transfer bandwidth, significantly reducing overall bandwidth requirements. For example, compared to the state-of-the-art CNN accelerator designs targeting a Virtex-7 690T FPGA, Escher reduces the accelerator peak bandwidth requirements by 2.4x across both fully-connected and convolutional layers on fixed-point AlexNet, and reduces convolutional layer bandwidth by up to 10.5x on fixed-point GoogleNet.

查看原文本刊更多论文

Escher:一个具有灵活缓冲的CNN加速器，以最大限度地减少片外传输

卷积神经网络(cnn)被用于解决许多具有挑战性的机器学习问题。对CNN的兴趣导致了CNN加速器的设计，以提高CNN的评估吞吐量和效率。重要的是，现代大型CNN权重数据传输的带宽需求导致CNN加速器存在严重的带宽瓶颈，需要批量处理图像以增加权重重用。然而，现有的CNN加速器设计限制了批量大小的选择，并且缺乏对卷积层批量处理的支持。我们观察到，对于给定的存储预算，选择最佳批大小需要平衡输入和权重转移。我们提出了Escher，一个CNN加速器，具有灵活的数据缓冲方案，确保输入和权重传输带宽之间的平衡，显着降低了总体带宽需求。例如，与针对Virtex-7 690T FPGA的最先进的CNN加速器设计相比，Escher在定点AlexNet的全连接层和卷积层上将加速器峰值带宽要求降低了2.4倍，并在定点GoogleNet上将卷积层带宽降低了10.5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

自引率

0.00%

发文量