A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC

Jun-Seok Park, Changsoo Park, S. Kwon, Hyeong-Seok Kim, Taeho Jeon, Yesung Kang, Heonsoo Lee, Dongwoo Lee, James Kim, YoungJong Lee, Sangkyu Park, Jun-Woo Jang, Sanghyuck Ha, MinSeong Kim, Jihoon Bang, Sukhwan Lim, Inyup Kang
{"title":"A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC","authors":"Jun-Seok Park, Changsoo Park, S. Kwon, Hyeong-Seok Kim, Taeho Jeon, Yesung Kang, Heonsoo Lee, Dongwoo Lee, James Kim, YoungJong Lee, Sangkyu Park, Jun-Woo Jang, Sanghyuck Ha, MinSeong Kim, Jihoon Bang, Sukhwan Lim, Inyup Kang","doi":"10.1109/ISSCC42614.2022.9731639","DOIUrl":null,"url":null,"abstract":"Recent work on neural-network accelerators has focused on obtaining high performance in order to meet the needs of real-time applications with vastly different performance requirements, including high precision computation, efficiency for various Deep Learning (DL) layer types, and extremely low power to run always-on applications. Applying a single mode or datatype uniformly across these different scenarios would be less efficient than using different operating modes according to different operating scenarios. For example, super-resolution typically requires FP16 precision for higher image quality, while NNs for face-detection need only INT4 or INT8 precision. Using higher precision than INT8 for face detection would result in higher power consumption. A highly programmable NPU capable of covering the diverse workloads observed in the real world is therefore desired. In this paper, we present a neural processing unit (NPU) optimized with the following features: i) reconfigurable data prefetching and operational flow for high compute utilization, ii) multi-precision MACs supporting INT4,8,16, and float16, iii) a dynamic operation mode to cover extremely low-power or low-latency requirements. These features provide the flexibility needed by real world applications within the power constraints of various product domains.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"9 1","pages":"246-248"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42614.2022.9731639","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Recent work on neural-network accelerators has focused on obtaining high performance in order to meet the needs of real-time applications with vastly different performance requirements, including high precision computation, efficiency for various Deep Learning (DL) layer types, and extremely low power to run always-on applications. Applying a single mode or datatype uniformly across these different scenarios would be less efficient than using different operating modes according to different operating scenarios. For example, super-resolution typically requires FP16 precision for higher image quality, while NNs for face-detection need only INT4 or INT8 precision. Using higher precision than INT8 for face detection would result in higher power consumption. A highly programmable NPU capable of covering the diverse workloads observed in the real world is therefore desired. In this paper, we present a neural processing unit (NPU) optimized with the following features: i) reconfigurable data prefetching and operational flow for high compute utilization, ii) multi-precision MACs supporting INT4,8,16, and float16, iii) a dynamic operation mode to cover extremely low-power or low-latency requirements. These features provide the flexibility needed by real world applications within the power constraints of various product domains.
4nm旗舰移动SoC中具有统一多精度数据路径的多模8K-MAC hw利用率感知神经处理单元
神经网络加速器最近的工作重点是获得高性能,以满足具有不同性能要求的实时应用的需求,包括高精度计算,各种深度学习(DL)层类型的效率,以及运行始终在线应用的极低功耗。与根据不同的操作场景使用不同的操作模式相比,在这些不同的场景中统一应用单一模式或数据类型的效率更低。例如,超分辨率通常需要FP16精度才能获得更高的图像质量,而用于人脸检测的神经网络只需要INT4或INT8精度。使用比INT8更高的精度进行人脸检测会导致更高的功耗。因此,需要一种能够覆盖现实世界中观察到的各种工作负载的高度可编程的NPU。在本文中,我们提出了一种优化的神经处理单元(NPU),具有以下特征:i)可重构的数据预取和高计算利用率的操作流程,ii)支持INT4,8,16和float16的多精度mac, iii)动态操作模式,以满足极低功耗或低延迟要求。这些特性提供了实际应用程序在各种产品领域的功率限制下所需的灵活性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信