{"title":"LayoutCopilot: An LLM-Powered Multiagent Collaborative Framework for Interactive Analog Layout Design","authors":"Bingyang Liu;Haoyi Zhang;Xiaohan Gao;Zichen Kong;Xiyuan Tang;Yibo Lin;Runsheng Wang;Ru Huang","doi":"10.1109/TCAD.2025.3529805","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3529805","url":null,"abstract":"Analog layout design heavily involves interactive processes between humans and design tools. electronic design automation (EDA) tools for this task are usually designed to use scripting commands or visualized buttons for manipulation, especially for interactive automation functionalities, which have a steep learning curve and cumbersome user experience, making a notable barrier to designers’ adoption. Aiming to address such a usability issue, this article introduces LayoutCopilot, a pioneering multiagent collaborative framework powered by large language models (LLMs) for interactive analog layout design. LayoutCopilot simplifies human-tool interaction by converting natural language instructions into executable script commands, and it interprets high-level design intents into actionable suggestions, significantly streamlining the design process. Experimental results demonstrate the flexibility, efficiency, and accessibility of LayoutCopilot in handling real-world analog designs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3126-3139"},"PeriodicalIF":2.7,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PBS: Program Behavior-Aware Scheduling for High-Level Synthesis","authors":"Aoxiang Qin;Rongjie Yang;Minghua Shen;Nong Xiao","doi":"10.1109/TCAD.2025.3529817","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3529817","url":null,"abstract":"Program behavior comprises operation dependency and resource requirement. They impact the performance of scheduling in high-level synthesis (HLS). Most existing scheduling methods focus on one aspect, resulting in poor performance. In this article, we propose PBS, a program behavior-aware scheduling method for HLS. We leverage a hybrid state encoding scheme to facilitate the comprehensive learning of program behaviors. Moreover, we propose bi-directional GNN and multiresolution aggregation schemes for learning complex operation dependency behavior. These schemes are integrated in an RL framework to iteratively improve scheduling solutions toward low latency and resource usage. Experiments show that PBS provides an average 32.7%, 26.3%, and 25.9% latency reductions, compared with the SDC, GNN-based, and RL-based methods, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3006-3019"},"PeriodicalIF":2.7,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ChainPIM: A ReRAM-Based Processing-in-Memory Accelerator for HGNNs via Chain Structure","authors":"Wenjing Xiao;Jianyu Wang;Dan Chen;Chenglong Shi;Xin Ling;Min Chen;Thomas Wu","doi":"10.1109/TCAD.2025.3528906","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3528906","url":null,"abstract":"Heterogeneous graph neural networks (HGNNs) have recently demonstrated significant advantages of capturing powerful structural and semantic information in heterogeneous graphs. Different from homogeneous graph neural networks directly aggregating information based on neighbors, HGNNs aggregate information based on complex metapaths. ReRAM-based processing-in-memory (PIM) architecture can reduce data movement and compute matrix-vector multiplication (MVM) in analog. It can be well used to accelerate HGNNs. However, the complex metapath-based aggregation of HGNNs makes it challenging to efficiently utilize the parallelism of ReRAM and vertices data reuse. To this end, we propose ChainPIM, the first ReRAM-based processing-in-memory accelerator for HGNNs featuring high-computing parallelism and vertices data reuse. Specifically, we introduce R-chain, which is based on a chain structure to build related metapath instances together. We can efficiently reuse vertices through R-chain and process different R-chains in parallel. Then, we further design an efficient storage format for storing R-chains, which reduces a lot of repeated vertices storage. Finally, a specialized ReRAM-based architecture is developed to pipeline different types of aggregations in HGNNs, fully exploiting the huge potential of multilevel parallelism in HGNNs. Our experiments show that ChainPIM achieves an average memory space reduction of 47.86% and performance improvement by <inline-formula> <tex-math>$128.29times $ </tex-math></inline-formula> compared to NVIDIA Tesla V100 GPU.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2516-2529"},"PeriodicalIF":2.7,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency","authors":"Matteo Perotti;Samuel Riedel;Matheus Cavalcante;Luca Benini","doi":"10.1109/TCAD.2025.3528349","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3528349","url":null,"abstract":"The ever-increasing computational and storage requirements of modern applications and the slowdown of technology scaling pose major challenges to designing and implementing efficient computer architectures. To mitigate the bottlenecks of typical processor-based architectures on both the instruction and data sides of the memory, we present Spatz, a compact 64 bit floating-point-capable vector processor based on RISC-V’s vector extension Zve64d. Using Spatz as the main Processing Element (PE), we design an open-source dualcore vector processor architecture based on a modular and scalable cluster sharing a Scratchpad Memory (SCM). Unlike typical vector processors, whose Vector Register Files (VRFs) are hundreds of KiB large, we prove that Spatz can achieve peak energy efficiency with a latch-based VRF of only 2 KiB. An implementation of the Spatz-based cluster in GlobalFoundries’ 12LPP process with eight double-precision Floating Point Units (FPUs) achieves an FPU utilization just 3.4% lower than the ideal upper bound on a double-precision, floating-point matrix multiplication. The cluster reaches 7.7 FMA/cycle, corresponding to 15.7 DP-GFLOPS and 95.7 GFLOPSDP/W at 1 GHz and nominal operating conditions (TT, 0.80 V, and 25 °C), with more than 55% of the power spent on the FPUs. Furthermore, the optimally balanced Spatz-based cluster reaches a 95.0% FPU utilization (7.6 FMA/cycle), 15.2 GFLOPSDP, and 99.3 GFLOPSDP/W (61% of the power spent in the FPU) on a 2D workload with <inline-formula> <tex-math>$7times 7$ </tex-math></inline-formula> kernel, resulting in an outstanding area/energy efficiency of 171 GFLOPSDP/W/mm2. At equi-area, the computing cluster built upon compact vector processors reaches a 30% higher energy efficiency than a cluster with the same FPU count built upon scalar cores specialized for stream-based floating-point computation.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2488-2502"},"PeriodicalIF":2.7,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boyu Li;Zongwei Zhu;Weihong Liu;Qianyue Cao;Changlong Li;Cheng Ji;Xi Li;Xuehai Zhou
{"title":"Magnifier: A Chiplet Feature-Aware Test Case Generation Method for Deep Learning Accelerators","authors":"Boyu Li;Zongwei Zhu;Weihong Liu;Qianyue Cao;Changlong Li;Cheng Ji;Xi Li;Xuehai Zhou","doi":"10.1109/TCAD.2025.3528358","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3528358","url":null,"abstract":"The development of deep learning has led to increasing demands for computation and memory, making multichiplet accelerators a powerful solution. Multichiplet accelerators require more precise consideration of hardware configurations and mapping schemes in terms of computation, memory, and communication patterns compared to monolithic designs, in order to avoid underutilization of performance. However, there is currently a lack of performance testing methods specifically tailored for multichiplet accelerators. Existing testing methods primarily focus on correctness testing and do not address potential performance issues from a hardware perspective. To address these issues, this article proposes Magnifier: a test case generation method for performance testing of multichiplet accelerators. First, we analyze typical multichiplet accelerator prototype from the perspectives of computation, memory, and communication patterns, and summarize a chiplet feature-aware operator task set. Next, we define the test evaluation metric interdevice percentile performance standard deviation and use a candidate operator set to construct a sampling space for model-level test cases. Finally, we build a generative adversarial network to learn the distribution of high-diversity test cases, enabling the rapid generation of high-quality test cases. We validate the proposed method on both simulated and real multichiplet accelerators. Experiments show that Magnifier can improve the metric of test cases by up to 3.42 times and significantly reduce generation time, providing valuable insights for optimizing the hardware and software of multichiplet accelerators.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2803-2816"},"PeriodicalIF":2.7,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Superscalar Time-Triggered Versatile-Tensor Accelerator","authors":"Yosab Bebawy;Aniebiet Micheal Ezekiel;Roman Obermaisser","doi":"10.1109/TCAD.2025.3528355","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3528355","url":null,"abstract":"Integrating AI hardware accelerators into safety-critical real-time systems to speed up the inference execution of safety-critical AI applications demands rigorous assurance to prevent potentially catastrophic outcomes, especially in environments where timely and accurate results are crucial. Even in cases where AI models are potentially designed and constructed correctly using AI frameworks, the system’s safety will also rely on the real-time behavior of the AI hardware accelerator. While AI hardware accelerators can achieve the necessary throughput, conventional accelerators, such as the versatile tensor accelerator (VTA) encounter significant challenges in predictability and reliability. These challenges stem from the variability in event-driven inference execution and insufficient timing control, posing considerable risks in safety-critical scenarios where delays in providing inference results can have severe consequences. To address this challenge, previous work introduced the time-triggered VTA (TT-VTA) to ensure timely execution of tensor operations. Nonetheless, the TT-VTA exhibited a slightly longer average inference time of 53 ms compared to the conventional VTA’s 51 ms, underscoring the ongoing need for optimization in this crucial domain to speed up the inference execution, while sustaining the deterministic and predictable behavior of the TT-VTA. This article proposes a novel superscalar TT-VTA (STT-VTA) architecture specifically designed to address the deficiencies of conventional VTAs and TT-VTAs. The STT-VTA architecture employs pattern-based timing schedules generated by an extended software simulator and an architecture configuration manager to analyze tensor operations within a given AI model and determine the required number of additional VTA modules for faster inference than a single (TT-)VTA setup. It integrates DRAMSim2 for memory instructions and a cycle-accurate simulator for nonmemory instructions. Evaluation using various models demonstrates that the STT-VTA achieves identical classification accuracy as the conventional VTA and TT-VTA, while improving performance and reducing inference time by 20%–41%. Moreover, it ensures deterministic temporal use of shared resources, such as memories and memory-buses and precise timing control to avoid interference. These results contribute toward safety and reliability of AI systems deployed in a safety-critical environment.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2503-2515"},"PeriodicalIF":2.7,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seung Ho Shin;Minho Cheong;Hayoung Lee;Byungsoo Kim;Sungho Kang
{"title":"A Novel CNN-Based Redundancy Analysis Using Parallel Solution Decision","authors":"Seung Ho Shin;Minho Cheong;Hayoung Lee;Byungsoo Kim;Sungho Kang","doi":"10.1109/TCAD.2025.3527905","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3527905","url":null,"abstract":"The increase in memory cell density and capacity has resulted in more faulty cells, necessitating the use of redundant memory row and column lines for repairs. However, existing redundancy analysis (RA) algorithms face a critical issue that RA time increases exponentially with the number of faulty cells. Furthermore, RA solutions for multiple memory chips cannot be derived simultaneously. In this study, a novel RA method is proposed using a convolutional neural network (CNN). The proposed RA algorithm also includes preprocessing to improve training accuracy. The solution locations on the fault map are predicted using multilabel classification. Moreover, parallel solution decision methods ensure that even if the CNN does not find the correct RA solution, an accurate final solution can still be derived, and PyCUDA is used to process multiple memories in parallel. From the experimental results, the normalized repair rate of the proposed RA is 100%. The RA time of the proposed RA is not affected by the number of faults but rather by the CNN execution time. Moreover, RA solutions for multiple memories can be quickly derived simultaneously by utilizing graphic processing unit parallel processing. In conclusion, a high yield and low test cost can be achieved.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2789-2802"},"PeriodicalIF":2.7,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Oxpecker: Leaking Secrets via Fetch Target Queue","authors":"Shan Li;Zheliang Xu;Haihua Shen;Huawei Li","doi":"10.1109/TCAD.2025.3527903","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3527903","url":null,"abstract":"Modern processors integrate carefully designed micro-architectural components within the front-end to optimize performance. These components include instruction cache, micro-operation cache, and instruction prefetcher. Through experimentation, we observed that the rate of instruction generation in the fetch unit markedly exceeds the execution rate in the decode unit. However, existing frameworks of processors fail to explain this phenomenon. Consequently, we empirically validate the presence of an optimization feature, referred to as the fetch target queue (FTQ), within the Intel processor. To the best of our knowledge, our study represents the first empirical validation of FTQ across various Intel processors and provides a comprehensive characterization of unrecorded FTQ micro-structural details on Intel processors. Our analysis uncovers overlooked insights that front-end rollbacks caused by the incorrectly ordered instructions or mismatched instruction lengths stored in FTQ introduce specific execution latencies. Based on these observations, we introduce the Oxpecker attack, consisting of two attack primitives, which leverages the FTQ to construct novel side-channel attacks. We construct two distinct exploitation scenarios for each attack primitive to demonstrate the Oxpecker attack’s capability to leak secret control flow information and break Kernel Address Space Layout Randomization.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2461-2474"},"PeriodicalIF":2.7,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lancheng Zou;Su Zheng;Peng Xu;Siting Liu;Bei Yu;Martin D. F. Wong
{"title":"Lay-Net: Grafting Netlist Knowledge on Layout-Based Congestion Prediction","authors":"Lancheng Zou;Su Zheng;Peng Xu;Siting Liu;Bei Yu;Martin D. F. Wong","doi":"10.1109/TCAD.2025.3527379","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3527379","url":null,"abstract":"Congestion modeling is crucial for enhancing the routability of VLSI placement solutions. The underutilization of netlist information constrains the efficacy of existing layout-based congestion modeling techniques. We devise a novel approach that grafts netlist-based message passing (MP) into a layout-based model, thereby achieving a better knowledge fusion between layout and netlist to improve congestion prediction performance. The innovative heterogeneous MP paradigm more effectively incorporates routing demand into the model by considering connections between cells, overlaps of nets, and interactions between cells and nets. Leveraging multiscale features, the proposed model effectively captures connection information across various ranges, addressing the issue of inadequate global information present in existing models. Using contrastive learning and mini-Gnet techniques allows the model to learn and represent features more effectively, boosting its capabilities and achieving superior performance. Extensive experiments demonstrate a notable performance enhancement of the proposed model compared to existing methods. Our code is available at: <uri>https://github.com/lanchengzou/congPred</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2627-2640"},"PeriodicalIF":2.7,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward Precise and Explainable Hardware Trojan Localization at LUT Level","authors":"Hao Su;Wei Hu;Xuelin Zhang;Dan Zhu;Lingjuan Wu","doi":"10.1109/TCAD.2025.3527377","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3527377","url":null,"abstract":"Trojans represent a severe threat to hardware security and trust. This work investigates the Trojan detection problem from a unique viewpoint and proposes a novel hardware Trojan localization method targeting FPGA netlists. The proposed method automatically extracts the rich structural and behavioral features at look-up-table (LUT) level to train an explainable graph neural network (GNN) model for classifying design nodes in FPGA netlists and identifying the Trojan-infected ones. Experimental results using 183 hardware Trojan benchmarks show that our method successfully pinpoints Trojan-infected nodes with true positive rate, accuracy and area under the ROC curve (AUC) of 95.14%, 95.71%, and 95.46%, respectively. To the best of our knowledge, this is the first LUT level Trojan localization solution using explainable GNNs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2817-2821"},"PeriodicalIF":2.7,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}