Chung-Bin Wu, Y. Hwang, Yu-Cheng Hsueh, Yu-Kuan Hsiao
{"title":"High Efficient Bandwidth Utilization Hardware Design and Implement for AI Deep Learning Accelerator","authors":"Chung-Bin Wu, Y. Hwang, Yu-Cheng Hsueh, Yu-Kuan Hsiao","doi":"10.1109/ISOCC50952.2020.9333025","DOIUrl":null,"url":null,"abstract":"This paper proposes a neural network accelerator for Tiny-Yolo V2. The data format of input feature maps, output feature maps, and weight kernels are converted to uint8 through a quantization strategy to reduce the data size and make the hardware utilization more efficient. Moreover, we propose an input feature maps placement method to reduce bandwidth utilization and improve PE utilization. To verify the hardware implementation, the Xilinx ZCU102 platform is used to verify the hardware architecture. Synthesis results show that the proposed architecture implements in 90nm can achieve 14.4GOPS@100Mhz with area efficiency by 99 GOPS/M-gates.","PeriodicalId":270577,"journal":{"name":"2020 International SoC Design Conference (ISOCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International SoC Design Conference (ISOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISOCC50952.2020.9333025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
High Efficient Bandwidth Utilization Hardware Design and Implement for AI Deep Learning Accelerator
This paper proposes a neural network accelerator for Tiny-Yolo V2. The data format of input feature maps, output feature maps, and weight kernels are converted to uint8 through a quantization strategy to reduce the data size and make the hardware utilization more efficient. Moreover, we propose an input feature maps placement method to reduce bandwidth utilization and improve PE utilization. To verify the hardware implementation, the Xilinx ZCU102 platform is used to verify the hardware architecture. Synthesis results show that the proposed architecture implements in 90nm can achieve 14.4GOPS@100Mhz with area efficiency by 99 GOPS/M-gates.