基于分层阵列的最小片上存储器CNN加速器

Hyun-Wook Son, YongSeok Na, TaeHyun Kim, Ali A. Al-Hamid, Hyungwon Kim
{"title":"基于分层阵列的最小片上存储器CNN加速器","authors":"Hyun-Wook Son, YongSeok Na, TaeHyun Kim, Ali A. Al-Hamid, Hyungwon Kim","doi":"10.1109/ISOCC53507.2021.9613997","DOIUrl":null,"url":null,"abstract":"This paper presents an architecture of CNN accelerator based on a new processing element (PE) array called a diagonal cyclic array. It can significantly reduce the burden of repeated memory accesses for feature data and weight parameters for CNN models. To evaluate the effectiveness of the proposed architecture, we implemented a CNN accelerator for YOLOv4-Tiny consisting of 9 layers. We also present how to optimize the local buffer size with little sacrifice of inference speed. We evaluated the example CNN accelerator using FPGA implementation with 24932 LUTs, 584 DSP blocks and a on-chip memory of only 58KB. It demonstrates an accuracy 58% (mAP0.5) with computation time of 240ms for each input image using a clock speed of 100MHz. This speed is expected to reach 2.4ms using a clock speed of 1GHz, if implemented in a silicon SoC using a sub-micron process.","PeriodicalId":185992,"journal":{"name":"2021 18th International SoC Design Conference (ISOCC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"CNN Accelerator with Minimal On-Chip Memory Based on Hierarchical Array\",\"authors\":\"Hyun-Wook Son, YongSeok Na, TaeHyun Kim, Ali A. Al-Hamid, Hyungwon Kim\",\"doi\":\"10.1109/ISOCC53507.2021.9613997\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents an architecture of CNN accelerator based on a new processing element (PE) array called a diagonal cyclic array. It can significantly reduce the burden of repeated memory accesses for feature data and weight parameters for CNN models. To evaluate the effectiveness of the proposed architecture, we implemented a CNN accelerator for YOLOv4-Tiny consisting of 9 layers. We also present how to optimize the local buffer size with little sacrifice of inference speed. We evaluated the example CNN accelerator using FPGA implementation with 24932 LUTs, 584 DSP blocks and a on-chip memory of only 58KB. It demonstrates an accuracy 58% (mAP0.5) with computation time of 240ms for each input image using a clock speed of 100MHz. This speed is expected to reach 2.4ms using a clock speed of 1GHz, if implemented in a silicon SoC using a sub-micron process.\",\"PeriodicalId\":185992,\"journal\":{\"name\":\"2021 18th International SoC Design Conference (ISOCC)\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 18th International SoC Design Conference (ISOCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISOCC53507.2021.9613997\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 18th International SoC Design Conference (ISOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISOCC53507.2021.9613997","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

本文提出了一种基于新型处理单元(PE)阵列(对角循环阵列)的CNN加速器结构。它可以显著降低CNN模型特征数据和权值参数的重复内存访问负担。为了评估所提出架构的有效性,我们为YOLOv4-Tiny实现了一个由9层组成的CNN加速器。我们还介绍了如何在不牺牲推理速度的情况下优化本地缓冲区大小。我们使用FPGA实现24932个lut, 584个DSP块和仅58KB的片上内存来评估示例CNN加速器。在100MHz的时钟速度下,每个输入图像的计算时间为240ms,精度为58% (mAP0.5)。如果在使用亚微米工艺的硅SoC中实现,该速度预计将在时钟速度为1GHz的情况下达到2.4ms。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CNN Accelerator with Minimal On-Chip Memory Based on Hierarchical Array
This paper presents an architecture of CNN accelerator based on a new processing element (PE) array called a diagonal cyclic array. It can significantly reduce the burden of repeated memory accesses for feature data and weight parameters for CNN models. To evaluate the effectiveness of the proposed architecture, we implemented a CNN accelerator for YOLOv4-Tiny consisting of 9 layers. We also present how to optimize the local buffer size with little sacrifice of inference speed. We evaluated the example CNN accelerator using FPGA implementation with 24932 LUTs, 584 DSP blocks and a on-chip memory of only 58KB. It demonstrates an accuracy 58% (mAP0.5) with computation time of 240ms for each input image using a clock speed of 100MHz. This speed is expected to reach 2.4ms using a clock speed of 1GHz, if implemented in a silicon SoC using a sub-micron process.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信