{"title":"Architecture-Aware Optimization of Layer Fusion for Latency-Optimal CNN Inference","authors":"Minyong Yoon, Jungwook Choi","doi":"10.1109/AICAS57966.2023.10168659","DOIUrl":null,"url":null,"abstract":"Layer fusion is an effective technique for accelerating latency-sensitive CNN inference tasks on resource-constrained accelerators that exploit distributed on-chip integrated memory-accelerator processing-in memory (PIM). However, previous research primarily focused on optimizing memory access, neglecting the significant impact of hardware architecture on latency. This study presents an analytical latency model for a 2D systolic array accelerator, taking into account various hardware factors such as array dimensions, buffer size, and bandwidth. We then investigate the influence of hardware architecture and fusion strategies, including weight and overlap reuse, on performance; these aspects are insufficiently addressed in existing access-based fusion models. By incorporating layer fusion with our proposed latency model across different architectures, dataflows, and workloads, we achieve up to a 53.1% reduction in end-to-end network latency compared to an access-based model.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICAS57966.2023.10168659","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Layer fusion is an effective technique for accelerating latency-sensitive CNN inference tasks on resource-constrained accelerators that exploit distributed on-chip integrated memory-accelerator processing-in memory (PIM). However, previous research primarily focused on optimizing memory access, neglecting the significant impact of hardware architecture on latency. This study presents an analytical latency model for a 2D systolic array accelerator, taking into account various hardware factors such as array dimensions, buffer size, and bandwidth. We then investigate the influence of hardware architecture and fusion strategies, including weight and overlap reuse, on performance; these aspects are insufficiently addressed in existing access-based fusion models. By incorporating layer fusion with our proposed latency model across different architectures, dataflows, and workloads, we achieve up to a 53.1% reduction in end-to-end network latency compared to an access-based model.