卷积神经网络的大规模并行协处理器

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI:10.1109/ASAP.2009.25

M. Sankaradass, Venkata Jakkula, S. Cadambi, S. Chakradhar, Igor Durdanovic, E. Cosatto, H. Graf

{"title":"卷积神经网络的大规模并行协处理器","authors":"M. Sankaradass, Venkata Jakkula, S. Cadambi, S. Chakradhar, Igor Durdanovic, E. Cosatto, H. Graf","doi":"10.1109/ASAP.2009.25","DOIUrl":null,"url":null,"abstract":"We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a “meta-operator” to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm’s simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"224","resultStr":"{\"title\":\"A Massively Parallel Coprocessor for Convolutional Neural Networks\",\"authors\":\"M. Sankaradass, Venkata Jakkula, S. Cadambi, S. Chakradhar, Igor Durdanovic, E. Cosatto, H. Graf\",\"doi\":\"10.1109/ASAP.2009.25\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a “meta-operator” to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm’s simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.\",\"PeriodicalId\":202421,\"journal\":{\"name\":\"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"224\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASAP.2009.25\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP.2009.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 224

摘要

我们提出了一个大规模并行协处理器，用于加速卷积神经网络(cnn)，这是一类重要的机器学习算法。协处理器功能单元由并行二维卷积原语和执行CNN特有的子采样和非线性函数的可编程单元组成，实现了CNN可以编译为的“元算子”。协处理器由具有大数据带宽的分布式片外存储库提供服务。作为一个关键特征，我们使用低精度数据，并通过在每次内存操作中包装多个单词进一步增加有效内存带宽，并利用算法的简单数据访问模式将片外存储器用作中间数据的刮擦板，这对cnn至关重要。一个CNN被映射到带有指令的协处理器硬件原语，用于在存储器和协处理器之间传输数据。我们已经在一个现成的PCI FPGA卡上实现了CNN协处理器的原型，该FPGA具有单个Xilinx Virtex5 LX330T FPGA和4个DDR2内存组，总计1GB。协处理器原型可以以每秒34亿次乘法累加(gmac)的速度处理CNN前向传播，比2.2 GHz AMD Opteron处理器的软件实现速度快31倍。对于一个完整的人脸识别应用，CNN在协处理器上，其余的图像处理任务在主机上，原型机的速度要快6-10倍，具体取决于主机协处理器的带宽。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Massively Parallel Coprocessor for Convolutional Neural Networks

We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a “meta-operator” to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm’s simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors

自引率

0.00%

发文量