DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2020-12-02 DOI:10.1145/3431920.3439286

Yuan Meng, S. Kuppannagari, R. Kannan, V. Prasanna

{"title":"DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference","authors":"Yuan Meng, S. Kuppannagari, R. Kannan, V. Prasanna","doi":"10.1145/3431920.3439286","DOIUrl":null,"url":null,"abstract":"Most of the existing work on FPGA acceleration of Convolutional Neural Network (CNN) focuses on employing a single strategy (algorithm, dataflow, etc.) across all the layers. Such an approach does not achieve optimal latency on complex and deep CNNs. Emerging CNNs have diverse per-layer computation characteristics including parallelism, arithmetic intensity, locality, and memory footprint. Per-layer strategy selection and fine-grained tuning are required to achieve low end-to-end latency. However, specialized hardware modules dedicated to each layer limit the per-layer utilization and adversely affect end-to-end latency. In this paper, we address these problems by an algorithm-architecture co-optimization framework, DYNAMAP, consisting of (1) a unified hardware overlay that can be reused across layers, supporting dynamic mapping of all three families of popular convolution algorithms, and further allowing flexible dataflow switching to maximize hardware utilization for each layer; (2) a novel software Design Space Exploration (DSE) flow that customizes the hardware overlay and chooses optimal strategy mapping. We show that the algorithm mapping space increases exponentially with network depth, and while the optimal algorithm selection problem is NP-hard in general, by exploiting the series-parallel structure of CNN models, we demonstrate a polynomial-time solution for optimal algorithm mapping. DYNAMAP is optimized for any CNN, including those having diverse computation and memory requirements across the layers. We demonstrate DYNAMAP using two state-of-the-art CNNs - GoogleNet and Inception-V4. The generated accelerators achieve up to 2.8x and 1.4x speedups, respectively, wrt inference latency compared with the state-of-the-art FPGA implementations.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431920.3439286","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Most of the existing work on FPGA acceleration of Convolutional Neural Network (CNN) focuses on employing a single strategy (algorithm, dataflow, etc.) across all the layers. Such an approach does not achieve optimal latency on complex and deep CNNs. Emerging CNNs have diverse per-layer computation characteristics including parallelism, arithmetic intensity, locality, and memory footprint. Per-layer strategy selection and fine-grained tuning are required to achieve low end-to-end latency. However, specialized hardware modules dedicated to each layer limit the per-layer utilization and adversely affect end-to-end latency. In this paper, we address these problems by an algorithm-architecture co-optimization framework, DYNAMAP, consisting of (1) a unified hardware overlay that can be reused across layers, supporting dynamic mapping of all three families of popular convolution algorithms, and further allowing flexible dataflow switching to maximize hardware utilization for each layer; (2) a novel software Design Space Exploration (DSE) flow that customizes the hardware overlay and chooses optimal strategy mapping. We show that the algorithm mapping space increases exponentially with network depth, and while the optimal algorithm selection problem is NP-hard in general, by exploiting the series-parallel structure of CNN models, we demonstrate a polynomial-time solution for optimal algorithm mapping. DYNAMAP is optimized for any CNN, including those having diverse computation and memory requirements across the layers. We demonstrate DYNAMAP using two state-of-the-art CNNs - GoogleNet and Inception-V4. The generated accelerators achieve up to 2.8x and 1.4x speedups, respectively, wrt inference latency compared with the state-of-the-art FPGA implementations.

查看原文本刊更多论文

DYNAMAP:低延迟CNN推理的动态算法映射框架

现有关于卷积神经网络(CNN) FPGA加速的大部分工作都集中在所有层上采用单一策略(算法，数据流等)。这种方法不能在复杂和深度cnn上实现最佳延迟。新兴的cnn具有不同的层计算特征，包括并行性、算法强度、局部性和内存占用。每层策略选择和细粒度调优需要实现低端到端延迟。然而，专用于每一层的专用硬件模块限制了每一层的利用率，并对端到端延迟产生不利影响。在本文中，我们通过一个算法-架构协同优化框架DYNAMAP来解决这些问题，DYNAMAP包括(1)一个可以跨层重用的统一硬件覆盖层，支持所有三种流行卷积算法的动态映射，并进一步允许灵活的数据流切换以最大限度地提高每层的硬件利用率;(2)一种新的软件设计空间探索(DSE)流程，可定制硬件覆盖并选择最优策略映射。我们证明了算法映射空间随着网络深度呈指数增长，而最优算法选择问题通常是np困难的，通过利用CNN模型的串联-并行结构，我们证明了最优算法映射的多项式时间解。DYNAMAP针对任何CNN进行了优化，包括那些跨层具有不同计算和内存需求的CNN。我们使用两个最先进的cnn - GoogleNet和Inception-V4来演示DYNAMAP。与最先进的FPGA实现相比，生成的加速器分别实现了高达2.8倍和1.4倍的加速，并缩短了推理延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量