卷积神经网络以内存为中心的加速器设计

2013 IEEE 31st International Conference on Computer Design (ICCD) Pub Date : 2013-11-07 DOI:10.1109/ICCD.2013.6657019

Maurice Peemen, A. Setio, B. Mesman, H. Corporaal

{"title":"卷积神经网络以内存为中心的加速器设计","authors":"Maurice Peemen, A. Setio, B. Mesman, H. Corporaal","doi":"10.1109/ICCD.2013.6657019","DOIUrl":null,"url":null,"abstract":"In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with enough flexibility to perform multiple vision tasks. A challenging problem for the design of efficient accelerators is the limited amount of external memory bandwidth. We show that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload. The efficiency of the on-chip memories is maximized by our scheduler that uses tiling to optimize for data locality. Our design flow ensures that on-chip memory size is minimized, which reduces area and energy usage. The design flow is evaluated by a High Level Synthesis implementation on a Virtex 6 FPGA board. Compared to accelerators with standard scratchpad memories the FPGA resources can be reduced up to 13× while maintaining the same performance. Alternatively, when the same amount of FPGA resources is used our accelerators are up to 11× faster.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"274","resultStr":"{\"title\":\"Memory-centric accelerator design for Convolutional Neural Networks\",\"authors\":\"Maurice Peemen, A. Setio, B. Mesman, H. Corporaal\",\"doi\":\"10.1109/ICCD.2013.6657019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with enough flexibility to perform multiple vision tasks. A challenging problem for the design of efficient accelerators is the limited amount of external memory bandwidth. We show that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload. The efficiency of the on-chip memories is maximized by our scheduler that uses tiling to optimize for data locality. Our design flow ensures that on-chip memory size is minimized, which reduces area and energy usage. The design flow is evaluated by a High Level Synthesis implementation on a Virtex 6 FPGA board. Compared to accelerators with standard scratchpad memories the FPGA resources can be reduced up to 13× while maintaining the same performance. Alternatively, when the same amount of FPGA resources is used our accelerators are up to 11× faster.\",\"PeriodicalId\":398811,\"journal\":{\"name\":\"2013 IEEE 31st International Conference on Computer Design (ICCD)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"274\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 31st International Conference on Computer Design (ICCD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCD.2013.6657019\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 31st International Conference on Computer Design (ICCD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD.2013.6657019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 274

摘要

在不久的将来，相机将作为灵活的传感器用于各种应用。出于移动性和隐私原因，所需的图像处理应该在具有性能要求和能源限制的嵌入式计算机平台上进行。卷积神经网络(CNN)的专用加速可以实现这些目标，并且具有足够的灵活性来执行多个视觉任务。对于高效加速器的设计来说，一个具有挑战性的问题是有限的外部存储器带宽。我们表明，支持CNN工作负载中复杂数据访问模式的灵活内存层次结构可以减少内存瓶颈的影响。我们的调度器使用平铺来优化数据局部性，从而使片上存储器的效率最大化。我们的设计流程确保片上存储器尺寸最小化，从而减少面积和能源使用。设计流程通过Virtex 6 FPGA板上的高级综合实现进行评估。与带有标准刮刮板存储器的加速器相比，FPGA资源最多可减少13倍，同时保持相同的性能。或者，当使用相同数量的FPGA资源时，我们的加速器速度可提高11倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Memory-centric accelerator design for Convolutional Neural Networks

In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with enough flexibility to perform multiple vision tasks. A challenging problem for the design of efficient accelerators is the limited amount of external memory bandwidth. We show that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload. The efficiency of the on-chip memories is maximized by our scheduler that uses tiling to optimize for data locality. Our design flow ensures that on-chip memory size is minimized, which reduces area and energy usage. The design flow is evaluated by a High Level Synthesis implementation on a Virtex 6 FPGA board. Compared to accelerators with standard scratchpad memories the FPGA resources can be reduced up to 13× while maintaining the same performance. Alternatively, when the same amount of FPGA resources is used our accelerators are up to 11× faster.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE 31st International Conference on Computer Design (ICCD)

自引率

0.00%

发文量