A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC

2022 IEEE International Solid- State Circuits Conference (ISSCC) Pub Date : 2022-02-20 DOI:10.1109/ISSCC42614.2022.9731639

Jun-Seok Park, Changsoo Park, S. Kwon, Hyeong-Seok Kim, Taeho Jeon, Yesung Kang, Heonsoo Lee, Dongwoo Lee, James Kim, YoungJong Lee, Sangkyu Park, Jun-Woo Jang, Sanghyuck Ha, MinSeong Kim, Jihoon Bang, Sukhwan Lim, Inyup Kang

{"title":"A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC","authors":"Jun-Seok Park, Changsoo Park, S. Kwon, Hyeong-Seok Kim, Taeho Jeon, Yesung Kang, Heonsoo Lee, Dongwoo Lee, James Kim, YoungJong Lee, Sangkyu Park, Jun-Woo Jang, Sanghyuck Ha, MinSeong Kim, Jihoon Bang, Sukhwan Lim, Inyup Kang","doi":"10.1109/ISSCC42614.2022.9731639","DOIUrl":null,"url":null,"abstract":"Recent work on neural-network accelerators has focused on obtaining high performance in order to meet the needs of real-time applications with vastly different performance requirements, including high precision computation, efficiency for various Deep Learning (DL) layer types, and extremely low power to run always-on applications. Applying a single mode or datatype uniformly across these different scenarios would be less efficient than using different operating modes according to different operating scenarios. For example, super-resolution typically requires FP16 precision for higher image quality, while NNs for face-detection need only INT4 or INT8 precision. Using higher precision than INT8 for face detection would result in higher power consumption. A highly programmable NPU capable of covering the diverse workloads observed in the real world is therefore desired. In this paper, we present a neural processing unit (NPU) optimized with the following features: i) reconfigurable data prefetching and operational flow for high compute utilization, ii) multi-precision MACs supporting INT4,8,16, and float16, iii) a dynamic operation mode to cover extremely low-power or low-latency requirements. These features provide the flexibility needed by real world applications within the power constraints of various product domains.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"9 1","pages":"246-248"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42614.2022.9731639","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Recent work on neural-network accelerators has focused on obtaining high performance in order to meet the needs of real-time applications with vastly different performance requirements, including high precision computation, efficiency for various Deep Learning (DL) layer types, and extremely low power to run always-on applications. Applying a single mode or datatype uniformly across these different scenarios would be less efficient than using different operating modes according to different operating scenarios. For example, super-resolution typically requires FP16 precision for higher image quality, while NNs for face-detection need only INT4 or INT8 precision. Using higher precision than INT8 for face detection would result in higher power consumption. A highly programmable NPU capable of covering the diverse workloads observed in the real world is therefore desired. In this paper, we present a neural processing unit (NPU) optimized with the following features: i) reconfigurable data prefetching and operational flow for high compute utilization, ii) multi-precision MACs supporting INT4,8,16, and float16, iii) a dynamic operation mode to cover extremely low-power or low-latency requirements. These features provide the flexibility needed by real world applications within the power constraints of various product domains.

查看原文本刊更多论文

4nm旗舰移动SoC中具有统一多精度数据路径的多模8K-MAC hw利用率感知神经处理单元

神经网络加速器最近的工作重点是获得高性能，以满足具有不同性能要求的实时应用的需求，包括高精度计算，各种深度学习(DL)层类型的效率，以及运行始终在线应用的极低功耗。与根据不同的操作场景使用不同的操作模式相比，在这些不同的场景中统一应用单一模式或数据类型的效率更低。例如，超分辨率通常需要FP16精度才能获得更高的图像质量，而用于人脸检测的神经网络只需要INT4或INT8精度。使用比INT8更高的精度进行人脸检测会导致更高的功耗。因此，需要一种能够覆盖现实世界中观察到的各种工作负载的高度可编程的NPU。在本文中，我们提出了一种优化的神经处理单元(NPU)，具有以下特征:i)可重构的数据预取和高计算利用率的操作流程，ii)支持INT4,8,16和float16的多精度mac, iii)动态操作模式，以满足极低功耗或低延迟要求。这些特性提供了实际应用程序在各种产品领域的功率限制下所需的灵活性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Solid- State Circuits Conference (ISSCC)

自引率

0.00%

发文量