Shuya Ji;Weidong Yang;Jianfei Jiang;Naifeng Jing;Honglan Jiang;Zhigang Mao;Qin Wang
{"title":"MACS: A Multidomain Collaborative Adaptive Clock Scheme for Large-Scale Reconfigurable Dataflow Accelerators","authors":"Shuya Ji;Weidong Yang;Jianfei Jiang;Naifeng Jing;Honglan Jiang;Zhigang Mao;Qin Wang","doi":"10.1109/TCAD.2025.3533305","DOIUrl":null,"url":null,"abstract":"To guarantee reliability and correctness, VLSI circuits are designed with conservative margins to maintain timing and power integrity against process, voltage, and temperature (PVT) variations across diverse workloads. However, worst-case PVT and workload conditions rarely occur in practice, resulting in significant timing slack and hence performance and energy loss, especially in reconfigurable dataflow accelerator RDA due to their large-scale and configurable features. Previous studies have attempted to exploit workload or PVT slack, yet achieving limited benefits for reconfigurable dataflow accelerator (RDAs) with large-scale processing element PE arrays. The key issues come from restricted scaling ranges for the clock, insufficient representations for the workload, and unbalanced workloads within processing elementss (PEs). To address these challenges, this article proposes the first multidomain collaborative adaptive clock scheme (MACS) to efficiently exploit both the workload and PVT timing slack for large-scale reconfigurable dataflow acceleratorss (RDAs). MACS partitions the RDA into several clock domains and allows constrained clock domain crossing, which enhances the hardware efficiency with minimal overhead and supports timing validation using conventional static timing analysis (STA) tools. In each domain, an operand-aware workload detection unit is developed, using both static configurations and dynamic operands to assess workload. The detected workload, combined with the monitored PVT conditions, determines the subsequent clock period. Additionally, to enable the exploration of timing slack over a broader range, the period range of the adaptive clock is extended. Experimental results show that MACS achieves a performance improvement of 76.3% or an energy saving of 36.6% with a hardware cost of 3.5%.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"2992-3005"},"PeriodicalIF":2.7000,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10851296/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
To guarantee reliability and correctness, VLSI circuits are designed with conservative margins to maintain timing and power integrity against process, voltage, and temperature (PVT) variations across diverse workloads. However, worst-case PVT and workload conditions rarely occur in practice, resulting in significant timing slack and hence performance and energy loss, especially in reconfigurable dataflow accelerator RDA due to their large-scale and configurable features. Previous studies have attempted to exploit workload or PVT slack, yet achieving limited benefits for reconfigurable dataflow accelerator (RDAs) with large-scale processing element PE arrays. The key issues come from restricted scaling ranges for the clock, insufficient representations for the workload, and unbalanced workloads within processing elementss (PEs). To address these challenges, this article proposes the first multidomain collaborative adaptive clock scheme (MACS) to efficiently exploit both the workload and PVT timing slack for large-scale reconfigurable dataflow acceleratorss (RDAs). MACS partitions the RDA into several clock domains and allows constrained clock domain crossing, which enhances the hardware efficiency with minimal overhead and supports timing validation using conventional static timing analysis (STA) tools. In each domain, an operand-aware workload detection unit is developed, using both static configurations and dynamic operands to assess workload. The detected workload, combined with the monitored PVT conditions, determines the subsequent clock period. Additionally, to enable the exploration of timing slack over a broader range, the period range of the adaptive clock is extended. Experimental results show that MACS achieves a performance improvement of 76.3% or an energy saving of 36.6% with a hardware cost of 3.5%.
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.