Automatic Generation of Multi-Precision Multi-Arithmetic CNN Accelerators for FPGAs

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-10-21 DOI:10.1109/icfpt47387.2019.00014

Yiren Zhao, Xitong Gao, Xuan Guo, Junyi Liu, Erwei Wang, R. Mullins, P. Cheung, G. Constantinides, Chengzhong Xu

{"title":"Automatic Generation of Multi-Precision Multi-Arithmetic CNN Accelerators for FPGAs","authors":"Yiren Zhao, Xitong Gao, Xuan Guo, Junyi Liu, Erwei Wang, R. Mullins, P. Cheung, G. Constantinides, Chengzhong Xu","doi":"10.1109/icfpt47387.2019.00014","DOIUrl":null,"url":null,"abstract":"Modern deep Convolutional Neural Networks (CNNs) are computationally demanding, yet real applications often require high throughput and low latency. To help tackle these problems, we propose Tomato, a framework designed to automate the process of generating efficient CNN accelerators. The generated design is pipelined and each convolution layer uses different arithmetics at various precisions. Using Tomato, we showcase state-of-the-art multi-precision multi-arithmetic networks, including MobileNet-V1, running on FPGAs. To our knowledge, this is the first multi-precision multi-arithmetic autogeneration framework for CNNs. In software, Tomato fine-tunes pretrained networks to use a mixture of short powers-of-2 and fixed-point weights with a minimal loss in classification accuracy. The fine-tuned parameters are combined with the templated hardware designs to automatically produce efficient inference circuits in FPGAs. We demonstrate how our approach significantly reduces model sizes and computation complexities, and permits us to pack a complete ImageNet network onto a single FPGA without accessing off-chip memories for the first time. Furthermore, we show how Tomato produces implementations of networks with various sizes running on single or multiple FPGAs. To the best of our knowledge, our automatically generated accelerators outperform closest FPGA-based competitors by at least 2-4× for lantency and throughput; the generated accelerator runs ImageNet classification at a rate of more than 3000 frames per second.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icfpt47387.2019.00014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Modern deep Convolutional Neural Networks (CNNs) are computationally demanding, yet real applications often require high throughput and low latency. To help tackle these problems, we propose Tomato, a framework designed to automate the process of generating efficient CNN accelerators. The generated design is pipelined and each convolution layer uses different arithmetics at various precisions. Using Tomato, we showcase state-of-the-art multi-precision multi-arithmetic networks, including MobileNet-V1, running on FPGAs. To our knowledge, this is the first multi-precision multi-arithmetic autogeneration framework for CNNs. In software, Tomato fine-tunes pretrained networks to use a mixture of short powers-of-2 and fixed-point weights with a minimal loss in classification accuracy. The fine-tuned parameters are combined with the templated hardware designs to automatically produce efficient inference circuits in FPGAs. We demonstrate how our approach significantly reduces model sizes and computation complexities, and permits us to pack a complete ImageNet network onto a single FPGA without accessing off-chip memories for the first time. Furthermore, we show how Tomato produces implementations of networks with various sizes running on single or multiple FPGAs. To the best of our knowledge, our automatically generated accelerators outperform closest FPGA-based competitors by at least 2-4× for lantency and throughput; the generated accelerator runs ImageNet classification at a rate of more than 3000 frames per second.

查看原文本刊更多论文

fpga多精度多算法CNN加速器的自动生成

现代深度卷积神经网络(cnn)的计算要求很高，但实际应用通常需要高吞吐量和低延迟。为了帮助解决这些问题，我们提出了一个名为Tomato的框架，该框架旨在自动生成高效的CNN加速器。生成的设计是流水线的，每个卷积层在不同的精度上使用不同的算法。使用Tomato，我们展示了最先进的多精度多算法网络，包括在fpga上运行的MobileNet-V1。据我们所知，这是第一个用于cnn的多精度多算法自动生成框架。在软件中，Tomato对预训练的网络进行微调，使其使用短2次幂和定点权重的混合，在分类精度上损失最小。将微调后的参数与模板化的硬件设计相结合，在fpga中自动生成高效的推理电路。我们演示了我们的方法如何显着减小模型尺寸和计算复杂性，并允许我们将完整的ImageNet网络打包到单个FPGA上，而无需首次访问片外存储器。此外，我们还展示了Tomato如何在单个或多个fpga上运行各种大小的网络实现。据我们所知，我们自动生成的加速器比最接近的基于fpga的竞争对手在延迟和吞吐量方面至少高出2-4倍;生成的加速器以每秒3000帧以上的速率运行ImageNet分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 International Conference on Field-Programmable Technology (ICFPT)

自引率

0.00%

发文量