基于重叠条纹推理的4.6-8.3 TOPS/W 1.2-4.9 TOPS cnn计算成像处理器，实现4K超高清30fps

ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC) Pub Date : 2022-09-19 DOI:10.1109/ESSCIRC55480.2022.9911515

Yu-Chun Ding, Kai-Pin Lin, Chi-Wen Weng, Li-Wei Wang, Huan-Ching Wang, Chun-Yeh Lin, Yong-Tai Chen, Chao-Tsung Huang

{"title":"基于重叠条纹推理的4.6-8.3 TOPS/W 1.2-4.9 TOPS cnn计算成像处理器，实现4K超高清30fps","authors":"Yu-Chun Ding, Kai-Pin Lin, Chi-Wen Weng, Li-Wei Wang, Huan-Ching Wang, Chun-Yeh Lin, Yong-Tai Chen, Chao-Tsung Huang","doi":"10.1109/ESSCIRC55480.2022.9911515","DOIUrl":null,"url":null,"abstract":"In this paper, we present an energy-efficient acceler-ator chip which supports high-quality CNN-based computational imaging applications at 4K UItra-UD 30fps. To address the huge requirement of DRAM bandwidth and computing energy, an overlapped stripe inference flow and a structure-sparse $\\text{CONV}3\\mathrm{x}3$ engine are proposed respectively. The former reduces DRAM bandwidth to 0.81-1.74 GB/s when supporting high-quality CNN inference with 16 to 29 layers at 4K UItra-UD 30fps. The latter reduces computing complexity by 40% without noticeable quality degradation, e.g. 0.02-0.03 dB of PSNR drop. More specifically, it uses only 4.9 intrinsic TOPS of computing capability at 200 MHz to approach the quality of dense models which demand up to 8.2 TOPS. In addition, a coarse-grained reconfigurable datapath is designed to support diverse Applications including super-resolution, denoising, and style transfer with high hardware efficiency. Fabricated in 40nm CMOS, this chip achieves 4.6-8.3 TOP/W of energy efficiency for high-quality computational imaging applications. We also implement an FPGA-aided system to demonstrate real-time processing for the diverse applications supported by the fabricated chip.","PeriodicalId":168466,"journal":{"name":"ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A 4.6-8.3 TOPS/W 1.2-4.9 TOPS CNN-based Computational Imaging Processor with Overlapped Stripe Inference Achieving 4K Ultra-HD 30fps\",\"authors\":\"Yu-Chun Ding, Kai-Pin Lin, Chi-Wen Weng, Li-Wei Wang, Huan-Ching Wang, Chun-Yeh Lin, Yong-Tai Chen, Chao-Tsung Huang\",\"doi\":\"10.1109/ESSCIRC55480.2022.9911515\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we present an energy-efficient acceler-ator chip which supports high-quality CNN-based computational imaging applications at 4K UItra-UD 30fps. To address the huge requirement of DRAM bandwidth and computing energy, an overlapped stripe inference flow and a structure-sparse $\\\\text{CONV}3\\\\mathrm{x}3$ engine are proposed respectively. The former reduces DRAM bandwidth to 0.81-1.74 GB/s when supporting high-quality CNN inference with 16 to 29 layers at 4K UItra-UD 30fps. The latter reduces computing complexity by 40% without noticeable quality degradation, e.g. 0.02-0.03 dB of PSNR drop. More specifically, it uses only 4.9 intrinsic TOPS of computing capability at 200 MHz to approach the quality of dense models which demand up to 8.2 TOPS. In addition, a coarse-grained reconfigurable datapath is designed to support diverse Applications including super-resolution, denoising, and style transfer with high hardware efficiency. Fabricated in 40nm CMOS, this chip achieves 4.6-8.3 TOP/W of energy efficiency for high-quality computational imaging applications. We also implement an FPGA-aided system to demonstrate real-time processing for the diverse applications supported by the fabricated chip.\",\"PeriodicalId\":168466,\"journal\":{\"name\":\"ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ESSCIRC55480.2022.9911515\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESSCIRC55480.2022.9911515","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在本文中，我们提出了一种节能加速器芯片，它支持4K UItra-UD 30fps的高质量cnn计算成像应用。针对DRAM对带宽和计算能量的巨大需求，分别提出了重叠条纹推理流和结构稀疏$\text{CONV}3\ maththrm {x}3$引擎。前者在支持16至29层4K UItra-UD 30fps的高质量CNN推理时，将DRAM带宽降低至0.81-1.74 GB/s。后者将计算复杂性降低了40%，而没有明显的质量下降，例如PSNR下降0.02-0.03 dB。更具体地说，它在200 MHz时仅使用4.9个固有TOPS的计算能力来接近需要高达8.2个TOPS的密集模型的质量。此外，设计了粗粒度可重构数据路径，以支持各种应用，包括超分辨率，去噪和高硬件效率的风格转换。该芯片采用40nm CMOS制造，可实现4.6-8.3 TOP/W的能量效率，用于高质量的计算成像应用。我们还实现了一个fpga辅助系统，以演示制造芯片支持的各种应用的实时处理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A 4.6-8.3 TOPS/W 1.2-4.9 TOPS CNN-based Computational Imaging Processor with Overlapped Stripe Inference Achieving 4K Ultra-HD 30fps

In this paper, we present an energy-efficient acceler-ator chip which supports high-quality CNN-based computational imaging applications at 4K UItra-UD 30fps. To address the huge requirement of DRAM bandwidth and computing energy, an overlapped stripe inference flow and a structure-sparse $\text{CONV}3\mathrm{x}3$ engine are proposed respectively. The former reduces DRAM bandwidth to 0.81-1.74 GB/s when supporting high-quality CNN inference with 16 to 29 layers at 4K UItra-UD 30fps. The latter reduces computing complexity by 40% without noticeable quality degradation, e.g. 0.02-0.03 dB of PSNR drop. More specifically, it uses only 4.9 intrinsic TOPS of computing capability at 200 MHz to approach the quality of dense models which demand up to 8.2 TOPS. In addition, a coarse-grained reconfigurable datapath is designed to support diverse Applications including super-resolution, denoising, and style transfer with high hardware efficiency. Fabricated in 40nm CMOS, this chip achieves 4.6-8.3 TOP/W of energy efficiency for high-quality computational imaging applications. We also implement an FPGA-aided system to demonstrate real-time processing for the diverse applications supported by the fabricated chip.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)

自引率

0.00%

发文量