Xiangyu Kong;Yi Huang;Longlong Chen;Jianfeng Zhu;Liangwei Li;Xingchen Man;Mingyu Gao;Shaojun Wei;Leibo Liu
{"title":"Raccoon: Lightweight Support for Comprehensive Control Flows in Reconfigurable Spatial Architectures","authors":"Xiangyu Kong;Yi Huang;Longlong Chen;Jianfeng Zhu;Liangwei Li;Xingchen Man;Mingyu Gao;Shaojun Wei;Leibo Liu","doi":"10.1109/TPDS.2025.3561145","DOIUrl":null,"url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) have emerged as promising candidates for digital signal processing, biomedical, and automotive applications, where energy efficiency and flexibility are paramount. Yet existing CGRAs suffer from the Amdahl bottleneck caused by constrained control handling via either off-device communication or expensive tag-matching mechanisms. More importantly, mapping control flow onto CGRAs is extremely arduous and time-consuming due to intricate instruction structures and hardware mechanisms. To counteract these limitations, we propose Raccoon, a portable and lightweight framework for CGRAs targeting vast control flows. Raccoon comprises a comprehensive approach that spans microarchitecture, HW/SW interface, and compiler aspects. Regarding microarchitecture, Raccoon incorporates specialized infrastructure for branch- and loop-level control patterns with concise execution mechanisms. The HW/SW interface of Raccoon includes well-characterized abstractions and instruction sets tailored for easy compilation, featuring custom operators and architectural models for control-oriented units. On the compiler front, Raccoon integrates advanced control handling techniques and employs a portable mapper leveraging reinforcement learning and Monte Carlo tree search. This enables agile mapping and optimization of the entire program, ensuring efficient execution and high-quality results. Through the cohesive co-design, Raccoon can empower various CGRAs with robust control-flow handling capabilities, surpassing conventional tagged mechanisms in terms of hardware efficiency and compiler adaptability. Evaluation results show that Raccoon achieves up to a 5.78× improvement in energy efficiency and a 2.24× reduction in cycle count over state-of-the-art CGRAs. Raccoon stands out for its versatility in managing intricate control flows and showcases remarkable portability across diverse CGRA architectures.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1294-1310"},"PeriodicalIF":5.6000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10965538/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Coarse-grained reconfigurable arrays (CGRAs) have emerged as promising candidates for digital signal processing, biomedical, and automotive applications, where energy efficiency and flexibility are paramount. Yet existing CGRAs suffer from the Amdahl bottleneck caused by constrained control handling via either off-device communication or expensive tag-matching mechanisms. More importantly, mapping control flow onto CGRAs is extremely arduous and time-consuming due to intricate instruction structures and hardware mechanisms. To counteract these limitations, we propose Raccoon, a portable and lightweight framework for CGRAs targeting vast control flows. Raccoon comprises a comprehensive approach that spans microarchitecture, HW/SW interface, and compiler aspects. Regarding microarchitecture, Raccoon incorporates specialized infrastructure for branch- and loop-level control patterns with concise execution mechanisms. The HW/SW interface of Raccoon includes well-characterized abstractions and instruction sets tailored for easy compilation, featuring custom operators and architectural models for control-oriented units. On the compiler front, Raccoon integrates advanced control handling techniques and employs a portable mapper leveraging reinforcement learning and Monte Carlo tree search. This enables agile mapping and optimization of the entire program, ensuring efficient execution and high-quality results. Through the cohesive co-design, Raccoon can empower various CGRAs with robust control-flow handling capabilities, surpassing conventional tagged mechanisms in terms of hardware efficiency and compiler adaptability. Evaluation results show that Raccoon achieves up to a 5.78× improvement in energy efficiency and a 2.24× reduction in cycle count over state-of-the-art CGRAs. Raccoon stands out for its versatility in managing intricate control flows and showcases remarkable portability across diverse CGRA architectures.
期刊介绍:
IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to:
a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing.
b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems.
c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation.
d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.