具有成本效益的芯片到芯片适配器和 C2C 通信感知调度器的高扩展性深度学习加速器

IF 3.8 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2024-07-01 DOI:10.1109/JETCAS.2024.3421553

Jicheon Kim;Chunmyung Park;Eunjae Hyun;Xuan Truong Nguyen;Hyuk-Jae Lee

{"title":"具有成本效益的芯片到芯片适配器和 C2C 通信感知调度器的高扩展性深度学习加速器","authors":"Jicheon Kim;Chunmyung Park;Eunjae Hyun;Xuan Truong Nguyen;Hyuk-Jae Lee","doi":"10.1109/JETCAS.2024.3421553","DOIUrl":null,"url":null,"abstract":"Multi-chip-module (MCM) technology heralds a new era for scalable DNN inference systems, offering a cost-effective alternative to large-scale monolithic designs by lowering fabrication and design costs. Nevertheless, MCMs often incur resource and performance overheads due to inter-chip communication, which largely reduce a performance gain in a scaling-out system. To address these challenges, this paper introduces a highly-scalable DNN accelerator with a lightweight chip-to-chip adapter (C2CA) and a C2C-communication-aware scheduler. Our design employs a C2CA for inter-chip communication, which accurately illustrates an MCM system with a constrained C2C bandwidth, e.g., about 1/16, 1/8, or 1/4 of an on-chip bandwidth. We empirically reveal that the limited C2C bandwidth largely affects the overall performance gain of an MCM system. For example, compared with the one-core engine, a four-chip MCM system with a constrained C2C bandwidth only achieves \n<inline-formula> <tex-math>$2.60\\times $ </tex-math></inline-formula>\n, \n<inline-formula> <tex-math>$3.27\\times $ </tex-math></inline-formula>\n, \n<inline-formula> <tex-math>$2.84\\times $ </tex-math></inline-formula>\n, and \n<inline-formula> <tex-math>$2.74\\times $ </tex-math></inline-formula>\n performance gains on ResNet50, DarkNet19, MobileNetV1, and EfficientNetS, respectively. Mitigating the problem, we propose a novel C2C-communication-aware scheduler with forward and backward inter-layer scheduling. Specifically, our scheduler effectively utilizes a C2C bandwidth while a core is performing its own computation. To demonstrate the effectiveness and practicality of our concept, we modeled our design with Verilog HDL and implemented it on an FPGA board, i.e., Xilinx ZCU104. The experimental results demonstrate that the system shows significant throughput improvements compared to a single-chip configuration, yielding average enhancements of \n<inline-formula> <tex-math>$1.87\\times $ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>$3.43\\times $ </tex-math></inline-formula>\n for two-chip and four-chip configurations, respectively, on ResNet50, DarkNet19, MobileNetV1, and EfficientNetS.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"455-468"},"PeriodicalIF":3.8000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Highly-Scalable Deep-Learning Accelerator With a Cost-Effective Chip-to-Chip Adapter and a C2C-Communication-Aware Scheduler\",\"authors\":\"Jicheon Kim;Chunmyung Park;Eunjae Hyun;Xuan Truong Nguyen;Hyuk-Jae Lee\",\"doi\":\"10.1109/JETCAS.2024.3421553\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-chip-module (MCM) technology heralds a new era for scalable DNN inference systems, offering a cost-effective alternative to large-scale monolithic designs by lowering fabrication and design costs. Nevertheless, MCMs often incur resource and performance overheads due to inter-chip communication, which largely reduce a performance gain in a scaling-out system. To address these challenges, this paper introduces a highly-scalable DNN accelerator with a lightweight chip-to-chip adapter (C2CA) and a C2C-communication-aware scheduler. Our design employs a C2CA for inter-chip communication, which accurately illustrates an MCM system with a constrained C2C bandwidth, e.g., about 1/16, 1/8, or 1/4 of an on-chip bandwidth. We empirically reveal that the limited C2C bandwidth largely affects the overall performance gain of an MCM system. For example, compared with the one-core engine, a four-chip MCM system with a constrained C2C bandwidth only achieves \\n<inline-formula> <tex-math>$2.60\\\\times $ </tex-math></inline-formula>\\n, \\n<inline-formula> <tex-math>$3.27\\\\times $ </tex-math></inline-formula>\\n, \\n<inline-formula> <tex-math>$2.84\\\\times $ </tex-math></inline-formula>\\n, and \\n<inline-formula> <tex-math>$2.74\\\\times $ </tex-math></inline-formula>\\n performance gains on ResNet50, DarkNet19, MobileNetV1, and EfficientNetS, respectively. Mitigating the problem, we propose a novel C2C-communication-aware scheduler with forward and backward inter-layer scheduling. Specifically, our scheduler effectively utilizes a C2C bandwidth while a core is performing its own computation. To demonstrate the effectiveness and practicality of our concept, we modeled our design with Verilog HDL and implemented it on an FPGA board, i.e., Xilinx ZCU104. The experimental results demonstrate that the system shows significant throughput improvements compared to a single-chip configuration, yielding average enhancements of \\n<inline-formula> <tex-math>$1.87\\\\times $ </tex-math></inline-formula>\\n and \\n<inline-formula> <tex-math>$3.43\\\\times $ </tex-math></inline-formula>\\n for two-chip and four-chip configurations, respectively, on ResNet50, DarkNet19, MobileNetV1, and EfficientNetS.\",\"PeriodicalId\":48827,\"journal\":{\"name\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"volume\":\"14 3\",\"pages\":\"455-468\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10579814/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10579814/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

多芯片模块（MCM）技术预示着可扩展 DNN 推理系统进入了一个新时代，它通过降低制造和设计成本，为大规模单片设计提供了一种具有成本效益的替代方案。然而，MCM 通常会因芯片间通信而产生资源和性能开销，这在很大程度上降低了扩展型系统的性能提升。为了应对这些挑战，本文介绍了一种具有轻量级芯片到芯片适配器（C2CA）和 C2C 通信感知调度器的高可扩展 DNN 加速器。我们的设计采用了用于芯片间通信的 C2CA，准确地说明了 C2C 带宽受限的 MCM 系统，如约为片上带宽的 1/16、1/8 或 1/4。我们通过经验发现，有限的 C2C 带宽在很大程度上影响了 MCM 系统的整体性能增益。例如，与单核引擎相比，C2C带宽受限的四芯片MCM系统在ResNet50、DarkNet19、MobileNetV1和EfficientNetS上分别只实现了2.60/times $、3.27/times $、2.84/times $和2.74/times $的性能提升。为缓解这一问题，我们提出了一种新型的 C2C 通信感知调度器，具有前向和后向层间调度功能。具体来说，我们的调度器可在内核执行自身计算时有效利用 C2C 带宽。为了证明我们概念的有效性和实用性，我们用 Verilog HDL 对我们的设计进行了建模，并在 FPGA 板（即 Xilinx ZCU104）上进行了实现。实验结果表明，与单芯片配置相比，该系统的吞吐量有了显著提高，在 ResNet50、DarkNet19、MobileNetV1 和 EfficientNetS 上，双芯片和四芯片配置的平均提高幅度分别为 1.87 美元和 3.43 美元。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Highly-Scalable Deep-Learning Accelerator With a Cost-Effective Chip-to-Chip Adapter and a C2C-Communication-Aware Scheduler

Multi-chip-module (MCM) technology heralds a new era for scalable DNN inference systems, offering a cost-effective alternative to large-scale monolithic designs by lowering fabrication and design costs. Nevertheless, MCMs often incur resource and performance overheads due to inter-chip communication, which largely reduce a performance gain in a scaling-out system. To address these challenges, this paper introduces a highly-scalable DNN accelerator with a lightweight chip-to-chip adapter (C2CA) and a C2C-communication-aware scheduler. Our design employs a C2CA for inter-chip communication, which accurately illustrates an MCM system with a constrained C2C bandwidth, e.g., about 1/16, 1/8, or 1/4 of an on-chip bandwidth. We empirically reveal that the limited C2C bandwidth largely affects the overall performance gain of an MCM system. For example, compared with the one-core engine, a four-chip MCM system with a constrained C2C bandwidth only achieves

$2.60\times $

$3.27\times $

$2.84\times $

, and

$2.74\times $

performance gains on ResNet50, DarkNet19, MobileNetV1, and EfficientNetS, respectively. Mitigating the problem, we propose a novel C2C-communication-aware scheduler with forward and backward inter-layer scheduling. Specifically, our scheduler effectively utilizes a C2C bandwidth while a core is performing its own computation. To demonstrate the effectiveness and practicality of our concept, we modeled our design with Verilog HDL and implemented it on an FPGA board, i.e., Xilinx ZCU104. The experimental results demonstrate that the system shows significant throughput improvements compared to a single-chip configuration, yielding average enhancements of

$1.87\times $

and

$3.43\times $

for two-chip and four-chip configurations, respectively, on ResNet50, DarkNet19, MobileNetV1, and EfficientNetS.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.