Which Coupled is Best Coupled? An Exploration of AIMC Tile Interfaces and Load Balancing for CNNs

IF 6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-08-02 DOI:10.1109/TPDS.2024.3437657

Joshua Klein;Irem Boybat;Giovanni Ansaloni;Marina Zapater;David Atienza

{"title":"Which Coupled is Best Coupled? An Exploration of AIMC Tile Interfaces and Load Balancing for CNNs","authors":"Joshua Klein;Irem Boybat;Giovanni Ansaloni;Marina Zapater;David Atienza","doi":"10.1109/TPDS.2024.3437657","DOIUrl":null,"url":null,"abstract":"Due to stringent energy and performance constraints, edge AI computing often employs heterogeneous systems that utilize both general-purpose CPUs and accelerators. Analog in-memory computing (AIMC) is a well-known AI inference solution that overcomes computational bottlenecks by performing matrix-vector multiplication operations (MVMs) in constant time. However, the tiles of AIMC-based accelerators are limited by the number of weights they can hold. State-of-the-art research often sizes neural networks to AIMC tiles (or vice-versa), but does not consider cases where AIMC tiles cannot cover the whole network due to lack of tile resources or the network size. In this work, we study the trade-offs of available AIMC tile resources, neural network coverage, AIMC tile proximity to compute resources, and multi-core load balancing techniques. We first perform a study of single-layer performance and energy scalability of AIMC tiles in the two most typical AIMC acceleration targets: dense/fully-connected layers and convolutional layers. This study guides the methodology with which we approach parameter allocation to AIMC tiles in the context of large edge neural networks, both where AIMC tiles are close to the CPU (tightly-coupled) and cannot share resources across the system, and where AIMC tiles are far from the CPU (loosely-coupled) and can employ workload stealing. We explore the performance and energy trends of six modern CNNs using different methods of load balancing for differently-coupled system configurations with variable AIMC tile resources. We show that, by properly distributing workloads, AIMC acceleration can be made highly effective even on under-provisioned systems. As an example, 5.9x speedup and 5.6x energy gains were measured on an 8-core system, for a 41% coverage of neural network parameters.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1780-1795"},"PeriodicalIF":6.0000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10622095/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Due to stringent energy and performance constraints, edge AI computing often employs heterogeneous systems that utilize both general-purpose CPUs and accelerators. Analog in-memory computing (AIMC) is a well-known AI inference solution that overcomes computational bottlenecks by performing matrix-vector multiplication operations (MVMs) in constant time. However, the tiles of AIMC-based accelerators are limited by the number of weights they can hold. State-of-the-art research often sizes neural networks to AIMC tiles (or vice-versa), but does not consider cases where AIMC tiles cannot cover the whole network due to lack of tile resources or the network size. In this work, we study the trade-offs of available AIMC tile resources, neural network coverage, AIMC tile proximity to compute resources, and multi-core load balancing techniques. We first perform a study of single-layer performance and energy scalability of AIMC tiles in the two most typical AIMC acceleration targets: dense/fully-connected layers and convolutional layers. This study guides the methodology with which we approach parameter allocation to AIMC tiles in the context of large edge neural networks, both where AIMC tiles are close to the CPU (tightly-coupled) and cannot share resources across the system, and where AIMC tiles are far from the CPU (loosely-coupled) and can employ workload stealing. We explore the performance and energy trends of six modern CNNs using different methods of load balancing for differently-coupled system configurations with variable AIMC tile resources. We show that, by properly distributing workloads, AIMC acceleration can be made highly effective even on under-provisioned systems. As an example, 5.9x speedup and 5.6x energy gains were measured on an 8-core system, for a 41% coverage of neural network parameters.

查看原文本刊更多论文

哪种耦合是最佳耦合？AIMC 瓦片接口和 CNN 负载平衡探索

由于严格的能耗和性能限制，边缘人工智能计算通常采用同时使用通用 CPU 和加速器的异构系统。模拟内存计算（AIMC）是一种著名的人工智能推理解决方案，它通过在恒定时间内执行矩阵-向量乘法运算（MVM）来克服计算瓶颈。然而，基于 AIMC 的加速器所能容纳的权重数量有限。最先进的研究通常会将神经网络的大小调整为 AIMC 瓦片（反之亦然），但不会考虑 AIMC 瓦片因缺乏瓦片资源或网络大小而无法覆盖整个网络的情况。在这项工作中，我们研究了可用 AIMC 瓦片资源、神经网络覆盖率、AIMC 瓦片与计算资源的接近程度以及多核负载平衡技术之间的权衡。我们首先研究了 AIMC 瓦片在两个最典型的 AIMC 加速目标中的单层性能和能量可扩展性：密集/全连接层和卷积层。这项研究为我们在大型边缘神经网络中处理 AIMC 瓦片参数分配提供了方法论指导，在这种情况下，AIMC 瓦片靠近 CPU（紧密耦合），无法在整个系统中共享资源，而在 AIMC 瓦片远离 CPU（松散耦合）的情况下，则可以采用工作负载窃取。我们探索了六种现代 CNN 的性能和能耗趋势，这些 CNN 采用了不同的负载均衡方法，适用于 AIMC 瓦片资源可变的不同耦合系统配置。我们的研究表明，通过适当分配工作负载，即使在配置不足的系统中，AIMC 加速也能非常有效。例如，在神经网络参数覆盖率为 41% 的 8 核系统上，我们测得了 5.9 倍的速度提升和 5.6 倍的能量增益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.