FlooNoC：一个645gb /s/link 0.15-pJ/B/hop的开源NoC，具有宽物理链路和端到端并行多流支持

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-01-22 DOI:10.1109/TVLSI.2025.3527225

Tim Fischer;Michael Rogenmoser;Thomas Benz;Frank K. Gürkaynak;Luca Benini

{"title":"FlooNoC：一个645gb /s/link 0.15-pJ/B/hop的开源NoC，具有宽物理链路和端到端并行多流支持","authors":"Tim Fischer;Michael Rogenmoser;Thomas Benz;Frank K. Gürkaynak;Luca Benini","doi":"10.1109/TVLSI.2025.3527225","DOIUrl":null,"url":null,"abstract":"The new generation of domain-specific AI accelerators is characterized by rapidly increasing demands for bulk data transfers, as opposed to small, latency-critical cache line transfers typical of traditional cache-coherent systems. In this article, we address this critical need by introducing the FlooNoC network-on-chip (NoC), featuring very wide, fully advanced extensible interface (AXI4) compliant links designed to meet the massive bandwidth needs at high energy efficiency. At the transport level, nonblocking transactions are supported for latency tolerance. In addition, a novel end-to-end ordering approach for AXI4, enabled by a multistream capable direct memory access (DMA) engine, simplifies network interfaces (NIs) and eliminates interstream dependencies. Furthermore, dedicated physical links are instantiated for short, latency-critical messages. A complete end-to-end reference implementation in 12-nm FinFET technology demonstrates the physical feasibility and power performance area (PPA) benefits of our approach. Using wide links on high levels of metal, we achieve a bandwidth of 645 Gb/s/link and a total aggregate bandwidth of 103 Tb/s for an <inline-formula> <tex-math>$8\\times 4$ </tex-math></inline-formula> mesh of processors’ cluster tiles, with a total of 288 RISC-V cores. The NoC imposes a minimal area overhead of only 3.5% per compute tile and achieves a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V. Compared with state-of-the-art (SoA) NoCs, our system offers three times the energy efficiency and more than double the link bandwidth. Furthermore, compared with a traditional AXI4-based multilayer interconnect, our NoC achieves a 30% reduction in area, corresponding to a 47% increase in GFLOPSDP within the same floorplan.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"1094-1107"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FlooNoC: A 645-Gb/s/link 0.15-pJ/B/hop Open-Source NoC With Wide Physical Links and End-to-End AXI4 Parallel Multistream Support\",\"authors\":\"Tim Fischer;Michael Rogenmoser;Thomas Benz;Frank K. Gürkaynak;Luca Benini\",\"doi\":\"10.1109/TVLSI.2025.3527225\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The new generation of domain-specific AI accelerators is characterized by rapidly increasing demands for bulk data transfers, as opposed to small, latency-critical cache line transfers typical of traditional cache-coherent systems. In this article, we address this critical need by introducing the FlooNoC network-on-chip (NoC), featuring very wide, fully advanced extensible interface (AXI4) compliant links designed to meet the massive bandwidth needs at high energy efficiency. At the transport level, nonblocking transactions are supported for latency tolerance. In addition, a novel end-to-end ordering approach for AXI4, enabled by a multistream capable direct memory access (DMA) engine, simplifies network interfaces (NIs) and eliminates interstream dependencies. Furthermore, dedicated physical links are instantiated for short, latency-critical messages. A complete end-to-end reference implementation in 12-nm FinFET technology demonstrates the physical feasibility and power performance area (PPA) benefits of our approach. Using wide links on high levels of metal, we achieve a bandwidth of 645 Gb/s/link and a total aggregate bandwidth of 103 Tb/s for an <inline-formula> <tex-math>$8\\\\times 4$ </tex-math></inline-formula> mesh of processors’ cluster tiles, with a total of 288 RISC-V cores. The NoC imposes a minimal area overhead of only 3.5% per compute tile and achieves a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V. Compared with state-of-the-art (SoA) NoCs, our system offers three times the energy efficiency and more than double the link bandwidth. Furthermore, compared with a traditional AXI4-based multilayer interconnect, our NoC achieves a 30% reduction in area, corresponding to a 47% increase in GFLOPSDP within the same floorplan.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 4\",\"pages\":\"1094-1107\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-01-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10848526/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10848526/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

新一代特定领域人工智能加速器的特点是对批量数据传输的需求快速增长，而传统缓存相干系统中典型的是对延迟要求极高的小型缓存线传输。在本文中，我们介绍了 FlooNoC 片上网络 (NoC)，以满足这一关键需求。FlooNoC 具有非常宽的、完全符合高级可扩展接口 (AXI4) 的链路，旨在以高能效满足海量带宽需求。在传输层，支持非阻塞事务以实现延迟容差。此外，多数据流直接内存访问（DMA）引擎支持 AXI4 的新型端到端排序方法，简化了网络接口（NI），消除了流间依赖性。此外，专门的物理链路被实例化，用于短小、延迟关键的信息。采用 12 纳米 FinFET 技术的完整端到端参考实现证明了我们方法的物理可行性和功耗性能面积 (PPA) 优势。通过在高水平金属上使用宽链路，我们实现了 645 Gb/s/链路的带宽和 103 Tb/s 的总带宽，适用于处理器集群瓦片的 $8/times 4$ 网格，总共有 288 个 RISC-V 内核。该 NoC 对每个计算单元的面积开销极小，仅为 3.5%，并在 0.8 V 电压下实现了 0.15 pJ/B/hop 的领先能效。与最先进的 SoA NoC 相比，我们的系统能效提高了三倍，链路带宽增加了一倍多。此外，与传统的基于 AXI4 的多层互连相比，我们的 NoC 减少了 30% 的面积，相当于在相同的平面图内增加了 47% 的 GFLOPSDP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FlooNoC: A 645-Gb/s/link 0.15-pJ/B/hop Open-Source NoC With Wide Physical Links and End-to-End AXI4 Parallel Multistream Support

The new generation of domain-specific AI accelerators is characterized by rapidly increasing demands for bulk data transfers, as opposed to small, latency-critical cache line transfers typical of traditional cache-coherent systems. In this article, we address this critical need by introducing the FlooNoC network-on-chip (NoC), featuring very wide, fully advanced extensible interface (AXI4) compliant links designed to meet the massive bandwidth needs at high energy efficiency. At the transport level, nonblocking transactions are supported for latency tolerance. In addition, a novel end-to-end ordering approach for AXI4, enabled by a multistream capable direct memory access (DMA) engine, simplifies network interfaces (NIs) and eliminates interstream dependencies. Furthermore, dedicated physical links are instantiated for short, latency-critical messages. A complete end-to-end reference implementation in 12-nm FinFET technology demonstrates the physical feasibility and power performance area (PPA) benefits of our approach. Using wide links on high levels of metal, we achieve a bandwidth of 645 Gb/s/link and a total aggregate bandwidth of 103 Tb/s for an

$8\times 4$

mesh of processors’ cluster tiles, with a total of 288 RISC-V cores. The NoC imposes a minimal area overhead of only 3.5% per compute tile and achieves a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V. Compared with state-of-the-art (SoA) NoCs, our system offers three times the energy efficiency and more than double the link bandwidth. Furthermore, compared with a traditional AXI4-based multilayer interconnect, our NoC achieves a 30% reduction in area, corresponding to a 47% increase in GFLOPSDP within the same floorplan.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.