{"title":"MCHEAS: Optimizing Large-Parameter NTT Over Multicluster In-Situ FHE Accelerating System","authors":"Zhenyu Guan;Yongqing Zhu;Luchang Lei;Hongyang Jia;Yi Chen;Bo Zhang;Changrui Ren;Jin Dong;Song Bian","doi":"10.1109/TCAD.2025.3555191","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555191","url":null,"abstract":"Fully Homomorphic encryption (FHE) enables high-level security but with a heavy computation workload, necessitating software-hardware co-design for aggressive acceleration. Recent works on specialized accelerators for HE evaluation have made significant progress in supporting lightweight RNS-CKKS applications, especially those with high-density in-memory computing techniques. To fulfill higher computational demands for more general applications, this article proposes multicluster HE accelerating system (MCHEAS), an accelerating system comprising multiple in-situ HE processing accelerators, each functioning as a cluster to perform large-parameter RNS-CKKS evaluation collaboratively. MCHEAS features optimization strategies including the synchronous, preemptive swap, square-diagonal, and odd-even index separation. Using these strategies to compile the computation and transmission of number theoretic transform (NTT) coefficients, the method optimizes the intercluster data swaps, a major bottleneck in NTT computations. Evaluations show that under 1 GHz, with different intercluster data transfer bandwidths, our approach accelerates NTT computations by 26.40% to 51.75%. MCHEAS also improves computing unit utilization by 10.30% to 33.97%, with a maximum peak utilization rate of up to 99.62%. MCHEAS achieves 17.63% to 34.67% speedups for HE operations involving NTT, and 15.12% to 30.62% speedups for demonstrated applications, while enhancing the computing units’ utilization by 5.18% to 21.87% during application execution. Furthermore, we compare MCHEAS with SOTA designs under a specific intercluster data transfer bandwidth, achieving up to <inline-formula> <tex-math>$81.45times $ </tex-math></inline-formula> their area efficiencies in applications.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3683-3696"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haishuang Fan;Rui Meng;Qichu Sun;Jingya Wu;Wenyan Lu;Xiaowei Li;Guihai Yan
{"title":"GRACE: An End-to-End Graph Processing Accelerator on FPGA With Graph Reordering Engine","authors":"Haishuang Fan;Rui Meng;Qichu Sun;Jingya Wu;Wenyan Lu;Xiaowei Li;Guihai Yan","doi":"10.1109/TCAD.2025.3555192","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555192","url":null,"abstract":"Graphs play an important role in various applications. With the rapid expansion of vertices in real life, existing large-scale graph processing frameworks on CPUs and GPUs encounter challenges in optimizing cache usage due to irregular memory access patterns. To address this, graph reordering has been proposed to improve the locality of the graph, but introduces significant overhead without delivering substantial end-to-end performance improvement. While there have been many FPGA-based accelerators for graph processing, achieving high throughput often requires complex graph prepossessing on CPUs. Therefore, implementing an efficient end-to-end graph processing system remains challenging. This article introduces GRACE, an end-to-end FPGA-based graph processing accelerator with a graph reordering engine and a pull-based vertex-centric programming model (PL-VCPM) Engine. First, GRACE employs a customized high-degree vertex cache (HDC) to improve memory access efficiency. Second, GRACE offloads the graph preprocessing to FPGA. We customize an efficient graph reordering engine to complete preprocessing. Third, GRACE adopts a graph pruning strategy to remove the activation and computation redundancy in graph processing. Finally, GRACE introduces a graph conflict board (GCB) to resolve data conflicts and a multiport cache to enhance parallel efficiency. Experimental results demonstrate that GRACE achieves <inline-formula> <tex-math>$7.1 times $ </tex-math></inline-formula> end-to-end performance speedup over CPU and <inline-formula> <tex-math>$1.8 times $ </tex-math></inline-formula> over GPU, as well as <inline-formula> <tex-math>$27.3 times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8.7 times $ </tex-math></inline-formula> energy efficiency over CPU and GPU. Moreover, GRACE delivers up to <inline-formula> <tex-math>$34.9 times $ </tex-math></inline-formula> performance speedup compared to the state-of-the-art FPGA accelerator.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3816-3829"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPOS: A General and Precise Offloading Strategy for High Generality of DNN Acceleration by OCP and NDP Co-Optimizing","authors":"Zixu Li;Wang Wang;Manni Li;Jiayu Yang;Zijian Huang;Xin Zhong;Yinyin Lin;Chengchen Wang;Xiankui Xiong","doi":"10.1109/TCAD.2025.3555184","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3555184","url":null,"abstract":"The arithmetic intensity (ArI) of different DNNs can be opposite. This challenges the generality of single acceleration architectures, including both dedicated on-chip processing (OCP) and near-data processing (NDP). Neither architecture can simultaneously achieve optimal energy efficiency and performance for operators with opposite ArI. It is relatively straightforward to think of combining the respective advantages of OCP and NDP. However, few publications have addressed their real-time co-optimization, primarily due to the lack of a quantifiable offloading method. Here, we propose GPOS, a general and precise offloading strategy that supports high generality of DNN acceleration. GPOS comprehensively considers the complex interactions between OCP and NDP, including hardware configurations, dataflow (DF), DNN model, and interdie data movements (DMs). Three quantifiable indicators—ArI, execution cost (Ex-cost), and DM-cost—are employed to precisely evaluate the impacts of these interactions on energy and latency. GPOS adopts a four-step flow with progressive refinement: each of the first three steps focuses on a single indicator at the operator level, while the final step performs context-based calibration to address operator interdependencies and avoid offsetting NDP benefits. Narrowing down offloading candidates in step 1 and step 3 significantly accelerates real-time quantitative analysis. Optimized mapping techniques and NDP-input stationary DF are proposed to reduce Ex-cost and extend operator types supported by NDP. Next, for the first time, sparsity—one of the most popular methods for energy optimization that can alter data reuse or ArI—is quantitatively investigated for its impacts on offloading using GPOS. Our evaluations include representative DNNs, including GPT-2, Bert, RNN, CNN, and MLP. GPOS achieves the minimum energy and latency for each benchmark, with geometric mean speedups of 49.0% and 94.1%, and geometric mean energy savings of 45.8% and 89.2% over All-OCP and All-NDP, respectively. GPOS also reduces offloading analysis latency by a geometric mean of 92.7% compared to the evaluation that traverses each operator and its relative combinations. On average, sparsity further improves performance and energy efficiency by increasing the number of operators offloaded to NDP. However, for DNNs where all operators exhibit either very high or very low ArI, the number of offloaded operators remains unchanged, even after sparsity is applied.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3776-3789"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward Fast Heterogeneous Virtual Prototypes: Increasing the Solver Efficiency in SystemC AMS","authors":"Alexandra K端ster;Rainer Dorsch;Christian Haubelt","doi":"10.1109/TCAD.2025.3554612","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3554612","url":null,"abstract":"The development of modern heterogeneous systems requires early integration of the various domains to improve and verify the design. Heterogeneous virtual prototypes are a key enabler to reach this goal. In order to efficiently support the development, their high simulation speed is of utmost importance. This article introduces measures to speed-up SystemC analog/mixed-signal (AMS) simulations which are commonly used to simulate the AMS part jointly with the digital prototype in SystemC. Two approaches to integrate variable-step ordinary differential equation solvers into the simulation semantics of SystemC AMS are presented. Both of them avoid global backtracking. One is well suited for feedback loops and the other is favorable for systems dynamically reacting onto events. Moreover, a timestep quantization is developed that overcomes the recurrent matrix inversion bottleneck of variable-step implicit solvers. A similar method is then used to increase the simulation speed of electrical linear network models with high switching activity. Various experiments from the context of smart sensors are presented which prove the effectiveness for enhancing the simulation speed.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3868-3881"},"PeriodicalIF":2.9,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianhua Gao;Zhi Zhou;Xingze Huang;Juan Wang;Yizhuo Wang;Weixing Ji
{"title":"PTPS: Precision-Aware Task Partitioning and Scheduling for SpMV on CPU-FPGA Heterogeneous Platforms","authors":"Jianhua Gao;Zhi Zhou;Xingze Huang;Juan Wang;Yizhuo Wang;Weixing Ji","doi":"10.1109/TCAD.2025.3554144","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3554144","url":null,"abstract":"The CPU-FPGA heterogeneous computing architecture is extensively employed in the embedded domain due to its low cost and power efficiency, with numerous sparse matrix-vector multiplication (SpMV) acceleration efforts already targeting this architecture. However, existing work rarely includes collaborative SpMV computations between CPU and FPGA, which limits the exploration of hybrid architectures that could potentially offer enhanced performance and flexibility. This article introduces an FPGA architecture design that supports multiprecision SpMV computations, including FP16, FP32, and FP64. Building on this, PTPS, a precision-aware SpMV task partitioning and dynamic scheduling algorithm tailored for the CPU-FPGA heterogeneous architecture, is proposed. The core idea of PTPS is lossless partitioning of sparse matrices across multiple precisions, prioritizing low-precision SpMV computations on the FPGA and high-precision computations on the CPU. PTPS not only leverages the strengths of CPU and FPGA for collaborative SpMV computations but also reduces data transmission overhead between them, thereby improving the overall computational efficiency. Experimental evaluation demonstrates that the proposed approach offers an average speedup of <inline-formula> <tex-math>$1.57times $ </tex-math></inline-formula> over the CPU-only approach and <inline-formula> <tex-math>$2.58times $ </tex-math></inline-formula> over the FPGA-only approach.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3804-3815"},"PeriodicalIF":2.9,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Make It Easy! Timing Leakage Analysis on Cryptographic Chips Based on Horizontal Leakage","authors":"Guangze Hong;An Wang;Congming Wei;Yaoling Ding;Shaofei Sun;Jingqi Zhang;Liehuang Zhu","doi":"10.1109/TCAD.2025.3553779","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3553779","url":null,"abstract":"Timing analysis presents a significant threat to cryptographic modules. However, traditional timing leakage analysis has notable limitations, especially when precise execution times cannot be obtained. In this article, we propose a novel timing leakage analysis method that leverages horizontal leakage in the power/electromagnetic channel by detecting the trace length of encryption processes under varying inputs. To demonstrate the effectiveness of our approach, we conducted systematic experimental evaluations across a range of cryptographic devices. In comparison to timing leakage analysis based on plaintext-ciphertext correlation, our method offers higher accuracy at lower testing costs and exhibits improved resistance to vertical noise.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"4048-4052"},"PeriodicalIF":2.9,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145100325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving I/O Performance and Fairness in NVMe SSDs With Pooling Portions of Cache Partitions","authors":"Jiaojiao Wu;Li Cai;Zhigang Cai;Fengxiang Zhang;Jianwei Liao","doi":"10.1109/TCAD.2025.3553778","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3553778","url":null,"abstract":"Nonvolatile memory express (NVMe) solid-state drives (SSDs) have become mainstream storage devices in today’s computing systems, due to their high throughput and ultralow latency. It has been observed that the impact of interference among all concurrently running streams (i.e., I/O workloads) on their overall responsiveness differs significantly in multistream SSDs, resulting in unfairness. This article proposes a cache division management scheme built on top of the evenly partition scheme for NVMe SSDs, to enhance I/O responsiveness without consciously sacrificing fairness. To this end, we first build a mathematical model to directly cut portions from the Local cache partitions allocated to concurrently running streams, considering their run-time performance measures. Then, our approach pools these portions together for the use of all streams. As a result, each stream has its corresponding Local cache space for ensuring fairness, meanwhile the pooled Global cache space is shared by all streams for enhancing I/O responsiveness. Trace-driven simulation experiments demonstrate that our proposal reduces the overall I/O latency by up to <monospace>24.4</monospace>%, and improve the measure of fairness by <inline-formula> <tex-math>$mathtt{2.5}times $ </tex-math></inline-formula> on average, in contrast to existing cache management schemes for NVMe SSDs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3710-3723"},"PeriodicalIF":2.9,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145090060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems society information","authors":"","doi":"10.1109/TCAD.2025.3566792","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3566792","url":null,"abstract":"","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"C2-C2"},"PeriodicalIF":2.7,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11007765","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems society information","authors":"","doi":"10.1109/TCAD.2025.3547448","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3547448","url":null,"abstract":"","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 4","pages":"C2-C2"},"PeriodicalIF":2.7,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10934962","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143667555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SaaP: Rearchitect SoC-as-a-Processor to Orchestrate Hardware Heterogeneity","authors":"Pengwei Jin;Zhe Fan;Yongwei Zhao;Zidong Du;Hongrui Guo;Ziyuan Nan;Yifan Hao;Chongxiao Li;Tianyun Ma;Zhenxing Zhang;Xiaqing Li;Wei Li;Xing Hu;Qi Guo;Zhiwei Xu;Tianshi Chen","doi":"10.1109/TCAD.2025.3553074","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3553074","url":null,"abstract":"Due to the end of Moore’s Law and Dennard Scaling, Domain-Specific Accelerators (DSAs) have come to a Cambrian explosion. Especially when advancing into the intelligent era, more and more DSAs are integrated into System-on-Chips (SoCs) as intellectual property (IP) blocks to provide high performance and efficiency. Currently, IPs usually expose IP-dependent hardware interfaces, requiring SoCs to manage them as isolated devices with software running on the host CPU. However, such software-managed heterogeneity in CPU-centric SoCs leads to low IP utilization. This inefficiency arises from the dependence on software optimization, coupled with the control and data exchange overheads. To improve IP utilization of heterogeneous SoCs, in this article, we rearchitect the SoC as a processor (i.e., SaaP) to orchestrate hardware heterogeneity. SaaP features an orchestration pipeline where DSAs are integrated as execution units and managed directly by the hardware pipeline to conceal the hardware heterogeneity from software. Moreover, SaaP redesigns the register file and data paths to implement an IP-level data-forwarding mechanism, avoiding the costly control and data exchange in the CPU-centric execution model. Block data dependence among different DSAs is carefully resolved to exploit mixed-level parallelism and inter-IP data exchange. SaaP abstracts tasks as mixed-scale instructions, where each instruction can be mapped to different IPs. Experimental results show that compared against Xavier on six fully software-optimized benchmarks from different domains, SaaP-rearchitected Xavier achieves a <inline-formula> <tex-math>$2.08{times }$ </tex-math></inline-formula> speedup, with an 8.21% area reduction and only 2.98% increase in power consumption.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3962-3975"},"PeriodicalIF":2.9,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145100426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}