Specializing for Efficiency: Customizing AI Inference Processors on FPGAs

2021 International Conference on Microelectronics (ICM) Pub Date : 2021-12-19 DOI:10.1109/ICM52667.2021.9664938

Andrew Boutros, E. Nurvitadhi, Vaughn Betz

{"title":"Specializing for Efficiency: Customizing AI Inference Processors on FPGAs","authors":"Andrew Boutros, E. Nurvitadhi, Vaughn Betz","doi":"10.1109/ICM52667.2021.9664938","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) has become an essential component in modern datacenter applications. The high computational complexity of AI algorithms and the stringent latency constraints for datacenter workloads necessitate the use of efficient specialized AI accelerators. However, the rapid changes in state-of-the-art AI algorithms as well as their varying compute and memory demands challenge accelerator deployments in datacenters as a result of the much slower hardware development cycle. To this end, field-programmable gate arrays (FPGAs) offer the necessary adaptability along with the desired custom hardware efficiency. However, FPGA design is non-trivial; it requires deep hardware expertise and suffers from long compile and debug times, making FPGAs difficult to use for software-oriented AI application developers. AI inference soft processor overlays address this by allowing application developers to write their AI algorithms in a high-level programming language, which are then compiled into instructions to be executed on an AI-targeted soft processor implemented on the FPGA. While the generality of such overlays can eliminate the long bitstream compile times and make FPGAs more accessible for application developers, some classes of the target workloads do not fully utilize the overlay resources resulting in sub-optimal efficiency. In this paper, we investigate the trade-off between hardware efficiency and designer productivity by quantifying the gains and costs of specializing overlays for different classes of AI workloads. We show that per-workload specialized variants of the neural processing unit (NPU), a state-of-the-art AI inference overlay, can achieve up to 41% better performance and 44% area savings.","PeriodicalId":212613,"journal":{"name":"2021 International Conference on Microelectronics (ICM)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Microelectronics (ICM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICM52667.2021.9664938","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Artificial intelligence (AI) has become an essential component in modern datacenter applications. The high computational complexity of AI algorithms and the stringent latency constraints for datacenter workloads necessitate the use of efficient specialized AI accelerators. However, the rapid changes in state-of-the-art AI algorithms as well as their varying compute and memory demands challenge accelerator deployments in datacenters as a result of the much slower hardware development cycle. To this end, field-programmable gate arrays (FPGAs) offer the necessary adaptability along with the desired custom hardware efficiency. However, FPGA design is non-trivial; it requires deep hardware expertise and suffers from long compile and debug times, making FPGAs difficult to use for software-oriented AI application developers. AI inference soft processor overlays address this by allowing application developers to write their AI algorithms in a high-level programming language, which are then compiled into instructions to be executed on an AI-targeted soft processor implemented on the FPGA. While the generality of such overlays can eliminate the long bitstream compile times and make FPGAs more accessible for application developers, some classes of the target workloads do not fully utilize the overlay resources resulting in sub-optimal efficiency. In this paper, we investigate the trade-off between hardware efficiency and designer productivity by quantifying the gains and costs of specializing overlays for different classes of AI workloads. We show that per-workload specialized variants of the neural processing unit (NPU), a state-of-the-art AI inference overlay, can achieve up to 41% better performance and 44% area savings.

查看原文本刊更多论文

专注于效率:在fpga上定制AI推理处理器

人工智能(AI)已成为现代数据中心应用的重要组成部分。人工智能算法的高计算复杂性和数据中心工作负载的严格延迟限制需要使用高效的专业人工智能加速器。然而，由于硬件开发周期较慢，最先进的人工智能算法的快速变化以及它们不同的计算和内存需求对数据中心中的加速器部署提出了挑战。为此，现场可编程门阵列(fpga)提供了必要的适应性以及所需的定制硬件效率。然而，FPGA的设计是不平凡的;它需要深厚的硬件专业知识，并且需要很长的编译和调试时间，这使得fpga难以用于面向软件的AI应用程序开发人员。人工智能推理软处理器覆盖层通过允许应用程序开发人员用高级编程语言编写他们的人工智能算法来解决这个问题，然后将其编译成指令，在FPGA上实现的针对人工智能的软处理器上执行。虽然这种覆盖的通用性可以消除长比特流编译时间，并使fpga更易于应用程序开发人员访问，但某些类别的目标工作负载没有充分利用覆盖资源，导致效率次优。在本文中，我们通过量化不同类别人工智能工作负载的专业化覆盖的收益和成本来研究硬件效率和设计师生产力之间的权衡。我们表明，神经处理单元(NPU)的每个工作负载专用变体(最先进的人工智能推理覆盖层)可以实现高达41%的性能提升和44%的面积节省。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 International Conference on Microelectronics (ICM)

自引率

0.00%

发文量