Architecture-Aware Optimization of Layer Fusion for Latency-Optimal CNN Inference

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS) Pub Date : 2023-06-11 DOI:10.1109/AICAS57966.2023.10168659

Minyong Yoon, Jungwook Choi

引用次数: 0

Abstract

Layer fusion is an effective technique for accelerating latency-sensitive CNN inference tasks on resource-constrained accelerators that exploit distributed on-chip integrated memory-accelerator processing-in memory (PIM). However, previous research primarily focused on optimizing memory access, neglecting the significant impact of hardware architecture on latency. This study presents an analytical latency model for a 2D systolic array accelerator, taking into account various hardware factors such as array dimensions, buffer size, and bandwidth. We then investigate the influence of hardware architecture and fusion strategies, including weight and overlap reuse, on performance; these aspects are insufficiently addressed in existing access-based fusion models. By incorporating layer fusion with our proposed latency model across different architectures, dataflows, and workloads, we achieve up to a 53.1% reduction in end-to-end network latency compared to an access-based model.

查看原文本刊更多论文

时延最优CNN推理层融合的体系结构感知优化

层融合是一种在资源受限的加速器上加速延迟敏感的CNN推理任务的有效技术，它利用了分布式片上集成存储器-加速器处理内存(PIM)。然而，以往的研究主要集中在优化内存访问，而忽略了硬件架构对延迟的重要影响。本研究提出了一个二维收缩阵列加速器的分析延迟模型，考虑了各种硬件因素，如阵列尺寸、缓冲区大小和带宽。然后，我们研究了硬件架构和融合策略(包括权重和重叠重用)对性能的影响;这些方面在现有的基于访问的融合模型中没有得到充分的解决。通过将层融合与我们提出的跨不同架构、数据流和工作负载的延迟模型相结合，与基于访问的模型相比，我们实现了端到端网络延迟减少53.1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

自引率

0.00%

发文量