{"title":"CINOC: Computing in Network-On-Chip With Tiled Many-Core Architectures for Large-Scale General Matrix Multiplications","authors":"Yao Qin;Mingyu Wang;Jiahua Yan;Tao Lu;Zhiyi Yu","doi":"10.1109/TCSI.2024.3466217","DOIUrl":null,"url":null,"abstract":"Large-scale general matrix multiplications (LMMs) are the key bottlenecks in various computation domains such as Transformer applications. However, it is a challenge to perform LMMs efficiently on traditional multi/many-core processor systems due to the large amount of memory access and the tight dependence of data transmission. By analyzing the aforementioned problems, we propose a computing in network-on-chip paradigm to perform LMMs by mitigating the performance losses caused by limited on-chip cache resources and memory bandwidth. Specifically, we propose a co-design of computable network-on-chip and the last-level cache method in tiled many-core architectures, which can reconstruct the redundant cache capacity as computable input buffer to balance the demands of computing, storage, and communication for the running LMM applications. Furthermore, a data-aware thread execution mechanism is also proposed to maximize the computational efficiency of thread streams in computable network. At the software level, memory-friendly matrix partitioning strategy, hybrid routing method and programming model are designed to bridge the gap between application demands and mismatched hardware/software interfaces. Experimental evaluations demonstrate that this proposed work achieves a computational latency reduction of 45% compared to the state-of-the-art GPU architecture, and the inference performance is improved by <inline-formula> <tex-math>$2\\times $ </tex-math></inline-formula> of the GPT network.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 3","pages":"1256-1268"},"PeriodicalIF":5.2000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10701550/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Large-scale general matrix multiplications (LMMs) are the key bottlenecks in various computation domains such as Transformer applications. However, it is a challenge to perform LMMs efficiently on traditional multi/many-core processor systems due to the large amount of memory access and the tight dependence of data transmission. By analyzing the aforementioned problems, we propose a computing in network-on-chip paradigm to perform LMMs by mitigating the performance losses caused by limited on-chip cache resources and memory bandwidth. Specifically, we propose a co-design of computable network-on-chip and the last-level cache method in tiled many-core architectures, which can reconstruct the redundant cache capacity as computable input buffer to balance the demands of computing, storage, and communication for the running LMM applications. Furthermore, a data-aware thread execution mechanism is also proposed to maximize the computational efficiency of thread streams in computable network. At the software level, memory-friendly matrix partitioning strategy, hybrid routing method and programming model are designed to bridge the gap between application demands and mismatched hardware/software interfaces. Experimental evaluations demonstrate that this proposed work achieves a computational latency reduction of 45% compared to the state-of-the-art GPU architecture, and the inference performance is improved by $2\times $ of the GPT network.
期刊介绍:
TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.