Madani Bachir;Azzaz Mohamed Salah;Sadoudi Said;Kaibou Redouane;Bruno da Silva
{"title":"Optimized Modular Adder Architecture for Cryptographic Applications on FPGAs","authors":"Madani Bachir;Azzaz Mohamed Salah;Sadoudi Said;Kaibou Redouane;Bruno da Silva","doi":"10.1109/TCAD.2024.3518412","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3518412","url":null,"abstract":"Modular addition is a fundamental operation in public-key cryptographic algorithms operating in finite fields, such as elliptic curve cryptography (ECC), Chebyshev polynomials, and post-quantum cryptography (PQC). The performance of these cryptographic algorithms is limited by the conventional modular adder approach, which incorporates two cascaded adders in series. This approach leads to a doubled critical path delay, ultimately causing a decrease in frequency despite utilizing a high-performance adder. This research presents a high-performance, low-area architecture for a modular adder, employing a novel approach. Specifically designed for various prime fields recommended in public key cryptography, the architecture optimally utilizes the carry chain and exploits the structural advantages of the 7-series field programmable gate array and series beyond. Implementation results demonstrate superior performance, achieving operating frequencies of 290.0 MHz for 192 bits and 205.5 MHz for 1024 bits. Notably, the proposed design performs modular addition in a single clock cycle, resulting in an approximate 57% frequency enhancement compared to the conventional approach. Consequently, this architecture stands as an optimal solution for systems demanding high-speed operations.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2168-2180"},"PeriodicalIF":2.7,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiyuan Xiao;Chen Wang;Jian Shen;Q. M. Jonathan Wu;Debiao He
{"title":"Less Traces Are All It Takes: Efficient Side-Channel Analysis on AES","authors":"Zhiyuan Xiao;Chen Wang;Jian Shen;Q. M. Jonathan Wu;Debiao He","doi":"10.1109/TCAD.2024.3518414","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3518414","url":null,"abstract":"In cryptography, side-channel analysis (SCA) is a technique used to recover cryptographic keys by examining the physical leakages that occur during the operation of cryptographic devices. Recent advancements in deep learning (DL) have greatly enhanced the extraction of crucial information from intricate leakage patterns. A considerable amount of research is dedicated to studying the SubByte (SB) operations of the advanced encryption standard (AES). This is because the SB process, which generates numerous transitions between 0s and 1s during encryption, results in significant energy leakage. However, traditional analysis models primarily focus on the initial round of SB operations in AES, which are less effective on mobile terminals where it is difficult to collect enough signals. These models often neglect additional operations and subsequent rounds, thus providing limited insights from small datasets. Consequently, this limitation has a direct impact on the accuracy and efficiency of key recovery. Our study uses <inline-formula> <tex-math>$rho $ </tex-math></inline-formula>-test analysis to show that significant leakage occurs not only during the S-box operation but also during the AddRoundKey (AR) phase of AES. To address these challenges, we propose a new SCA method, that is, optimized for small sample sizes. This method includes a new comprehensive round trace labeling algorithm, which simultaneously analyzes the SB and AR stages of each AES round. Additionally, we introduce the peak precise localization algorithm to accurately identify the points of energy leakage during each encryption round. Our experiments, conducted with power and electromagnetic (EM) datasets from the STM32F303 microcontroller, demonstrate that our method can reliably recover keys with as few as 20 traces. These results highlight the enhanced capability of our method in handling the complexities of small sample datasets in cryptographic analysis.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2080-2092"},"PeriodicalIF":2.7,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anastasios Dimitriou;Lei Xun;Jonathon Hare;Geoff V. Merrett
{"title":"Realization of Early-Exit Dynamic Neural Networks on Reconfigurable Hardware","authors":"Anastasios Dimitriou;Lei Xun;Jonathon Hare;Geoff V. Merrett","doi":"10.1109/TCAD.2024.3519055","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3519055","url":null,"abstract":"Early-exiting is a strategy that is becoming popular in deep neural networks (DNNs), as it can lead to faster execution and a reduction in the computational intensity of inference. To achieve this, intermediate classifiers abstract information from the input samples to strategically stop forward propagation and generate an output at an earlier stage. Confidence criteria are used to identify easier-to-recognize samples over the ones that need further filtering. However, such dynamic DNNs have only been realized in conventional computing systems (CPU+GPU) using libraries designed for static networks. In this article, we first explore the feasibility and benefits of realizing early-exit dynamic DNNs on field-programmable gate arrays (FPGAs), a platform already proven to be highly effective for neural network applications. We consider two approaches for implementing and executing the intermediate classifiers: 1) pipeline, which uses existing hardware and 2) parallel, which uses additional dedicated modules. We model their energy needs and execution time and explore their performance using the BranchyNet early-exit approach on LeNet-5, AlexNet, VGG19, and ResNet32, and a Xilinx ZCU106 Evaluation Board. We found that the dynamic approaches are at least 24% faster than a static network executed on an FPGA, consuming a minimum of <inline-formula> <tex-math>$1.32times $ </tex-math></inline-formula> lower energy. We further observe that FPGAs can enhance the performance of early-exit dynamic DNNs by minimizing the complexities introduced by the decision intermediate classifiers through parallel execution. Finally, we compare the two approaches and identify which is best for different network types and confidence levels.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2195-2203"},"PeriodicalIF":2.7,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaming Xu;Shan Huang;Jinhao Li;Guyue Huang;Yuan Xie;Yu Wang;Guohao Dai
{"title":"Enabling Efficient Sparse Multiplications on GPUs With Heuristic Adaptability","authors":"Jiaming Xu;Shan Huang;Jinhao Li;Guyue Huang;Yuan Xie;Yu Wang;Guohao Dai","doi":"10.1109/TCAD.2024.3518413","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3518413","url":null,"abstract":"Sparse matrix-vector/matrix multiplication, namely SpMMul, has become a fundamental operation during model inference in various domains. Previous studies have explored numerous optimizations to accelerate it. However, to enable efficient end-to-end inference, the following challenges remain unsolved: 1) incomplete design space and time-consuming preprocessing. Previous methods optimize SpMMul in limited loops and neglect the potential space exploration for further optimization, resulting in >30% waste of computing power. In addition, the preprocessing overhead in SparseTIR and DTC-SpMM is <inline-formula> <tex-math>$1000times $ </tex-math></inline-formula> larger than sparse computing; 2) incompatibility between static dataflow and dynamic input. A static dataflow can not always be efficient to all input, leading to >80% performance loss; and 3) simplistic algorithm performance analysis. Previous studies primarily analyze performance from algorithmic advantages, without considering other aspects like hardware and data features. To tackle the above challenges, we present DA-SpMMul, a Data-Aware heuristic GPU implementation for SpMMul in multiplatforms. DA-SpMMul creatively proposes: 1) complete design space based on theoretical computations and nontrivial implementations without preprocessing. We propose three orthogonal design principles based on theoretical computations and provide nontrivial implementations on standard formats, eliminating the complex preprocessing; 2) feature-enabled adaptive algorithm selection mechanism. We design a heuristic model to enable algorithm selection considering various features; and 3) comprehensive algorithm performance analysis. We extract the features from multiple perspectives and present a comprehensive performance analysis of all algorithms. DA-SpMMul supports PyTorch on both NVIDIA and AMD and achieves an average speedup of <inline-formula> <tex-math>$3.33times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$3.02times $ </tex-math></inline-formula> over NVIDIA cuSPARSE, and <inline-formula> <tex-math>$12.05times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8.32times $ </tex-math></inline-formula> over AMD rocSPARSE for sparse matrix-vector multiplication and sparse matrix-matrix multiplication, and up to <inline-formula> <tex-math>$1.48times $ </tex-math></inline-formula> speedup against the state-of-the-art open-source algorithm. Integrated with graph neural network framework, PyG, DA-SpMMul achieves up to <inline-formula> <tex-math>$1.22times $ </tex-math></inline-formula> speedup on GCN inference.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2226-2239"},"PeriodicalIF":2.7,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"B-HTRecognizer: Bitwise Hardware Trojan Localization Using Graph Attention Networks","authors":"Han Zhang;Zhenyu Fan;Yinhao Zhou;Ying Li","doi":"10.1109/TCAD.2024.3518417","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3518417","url":null,"abstract":"Hardware Trojans (HTs), which are malicious modifications injected into an integrated circuit (IC) by untrusted vendors, pose a significant threat to circuit design due to their highly destructive nature. The covert characteristics of HTs present challenges for detection methods, such as the requirement for transferable unknown circuit detection, the extensive manual effort involved, and the difficulty in fine-grained localization. To address these issues, we present B-HTRecognizer, a novel learning-based classification methodology that leverages HT similarities to automatically localize HTs in unknown designs at the bit level. In this study, we convert Verilog hardware description language (HDL) design into bit-level edge-featured data flow graphs (DFGs) using graph attention network (GAT) for multidimensional feature extraction of HTs. The bit-level feature extraction can achieve better performance when dealing with Trigger-hidden HTs, which are highly likely to bypass existing GNN solutions. Furthermore, we construct an open-source HT dataset named TrustHub IMEex The HTs dataset is publicly available at <uri>https://www.scidb.cn/en/anonymous/QjNFdmUy</uri> which extends the TrustHub dataset to facilitate effective training and precise localization. Through rigorous experimentation across different designs, our proposed method achieves 84% precision and 93% recall in noncross-design settings, and a recall rate of 77% on a 32-bit RISC-V design in cross-design testing.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2240-2252"},"PeriodicalIF":2.7,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA Technology Mapping With Adaptive Gate Decomposition","authors":"Chang Wu","doi":"10.1109/TCAD.2024.3515876","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3515876","url":null,"abstract":"FPGA technology mapping is an extensively studied problem. There is functional decomposition as well as graph covering-based approaches. For efficiency consideration, most existing algorithms are graph covering-based. However, logic synthesis can affect the graph covering results significantly. In this article, we propose an FPGA mapping algorithm with gate decomposition. Bin-packing is used to generate gate decompositions during mapping to avoid the decomposition choice pregeneration problem in existing approaches. Our results show that our algorithm can get an average of 13% area reduction over the state-of-the-art lossless synthesis-based mapping algorithm in ABC. When compared with industrial tool Vivado, we can get a significant area reduction of 35% on average.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2218-2225"},"PeriodicalIF":2.7,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenjie Zhong;Tao Sun;Jian-Tao Zhou;Zhuowei Wang;Xiaoyu Song
{"title":"A Reduced State-Space Generation Method for Concurrent Systems Based on CPN-PR Model","authors":"Wenjie Zhong;Tao Sun;Jian-Tao Zhou;Zhuowei Wang;Xiaoyu Song","doi":"10.1109/TCAD.2024.3515857","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3515857","url":null,"abstract":"Colored Petri nets (CPNs) provide descriptions of the concurrent behaviors for software and hardware. Model checking based on CPNs is an effective method to simulate and verify the concurrent behavior in system design. However, the model-checking method traverses the full state space, which suffers from the state-space explosion problem. A reduced state-space generation method related to the property of concurrent systems is proposed. Specifically, we extend CPNs to define a property-related model (CPN-PR) and give a property-related analysis method whose results can be used to generate the CPN-PR model. A reduced state-space generation method is developed based on enabled binding element filtering rules. The stutter trace equivalence between the state spaces of CPN and CPN-PR has been proven by showing that the reduced state space may not change the model-checking result. A comparison experiment is conducted to demonstrate the effectiveness of our method.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2328-2342"},"PeriodicalIF":2.7,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144100059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ling Zhang;Xuefei Yang;Zhenlong Wan;Hang Liu;Wei Gu;Pingjing Liu;Qilin Dai;Shanwei Ye;Yingcheng Lin
{"title":"A High-Performance RDMA NIC With Ultrahighly Scalable Connections","authors":"Ling Zhang;Xuefei Yang;Zhenlong Wan;Hang Liu;Wei Gu;Pingjing Liu;Qilin Dai;Shanwei Ye;Yingcheng Lin","doi":"10.1109/TCAD.2024.3514782","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3514782","url":null,"abstract":"Remote direct memory access (RDMA) technology has significantly enhanced network bandwidth and decreased transmission latency through kernel bypass and protocol offloading, overcoming obstacles in distributed computing systems. However, with the deployment of more intricate services in RDMA networks, current RDMA network interface cards (RNICs) have experienced a notable performance decline as the number of queue pair (QP) connections increases, substantially constraining the broad acceptance of RDMA networks. To address this challenge, this article proposes a novel RNIC architecture with high connection scalability. This architecture incorporates a multitiered cache structure to handle diverse communication contexts, enabling RNIC to support ultrahigh QP connection numbers while minimizing on-chip memory usage. In addition, the architecture facilitates chain prefetching, allowing on-chip caches to manage multiple concurrent requests; thus, averting latency resulting from cache misses and access conflicts during communication under concurrent multiple QP scenarios. This ensures transmission performance in multi-QPs connection scenarios. This article implements and validates the performance of a 100G RNIC based on this architecture on Xilinx’s U280 FPGA. With approximately 1 M memory usage on-chip for context, it can support 64 K performant QP connections (<inline-formula> <tex-math>$25times $ </tex-math></inline-formula> than CX-6) and can be extended if necessary. Experimental results confirm the high connection scalability of the RNIC, achieving approximately 92 Gb/s network throughput for data packet transmission with concurrent execution of 1–64 K QPs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2156-2167"},"PeriodicalIF":2.7,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Efficient Hardware Accelerator for Learning-Based Image Compression","authors":"Chen Chen;Haoyang Zhang;Kaicheng Guo;Xingzi Yu;Weidong Qiu;Zhengwei Qi;Haibing Guan","doi":"10.1109/TCAD.2024.3515856","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3515856","url":null,"abstract":"Recently, learning-based image compression (LIC) methods have surpassed manually designed approaches in both compression quality and bitrate. However, increasing computational demands and insufficient optimizations in codec performance have hindered the advancement of LIC acceleration. Most researches focus on optimizing specific components, often neglecting the sources of underutilization during the execution of LIC models. Generally, efficient LIC acceleration encounters three primary challenges: 1) extra overheads introduced by individual optimizations; 2) load and computation imbalances in small kernels; and 3) mismatches between hardware configurations and the LIC models. To address these challenges, we propose a framework named extensive accelerator for LIC (X-LIC) for efficiently exploring the design space under constrained resources. First, we quantitatively characterize a representative LIC model, including its latency, computation size, and temporal utilization across various accelerators. We design a hardware-optimized quantization method to compensate for the lack of LIC-oriented research, particularly regarding data precision, distortion, and resource consumption. Additionally, we propose a parameterized LIC accelerator architecture that integrates seamlessly with existing loop optimization models and supports various LIC operators. Two optimization schemes are proposed for redundant computation in transposed convolution and load and computation imbalance in small kernels. Experimental results show that our framework demonstrates significant flexibility across a broad design space, achieving an average of 78%–95% of the theoretical peak performance and up to 688.2/759.1 GOP/s en/de-coder performance with INT8 precision. As a result, the en/de-coder performance can reach up to 33/36 FPS in 720P resolution. An FPGA demo of X-LIC is available at <uri>https://github.com/sjtu-tcloud/X-LIC</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2204-2217"},"PeriodicalIF":2.7,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-Efficient DNN Inferencing on ReRAM-Based PIM Accelerators Using Heterogeneous Operation Units","authors":"Gaurav Narang;Janardhan Rao Doppa;Partha Pratim Pande","doi":"10.1109/TCAD.2024.3514778","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3514778","url":null,"abstract":"Operation unit (OU)-based configurations enable the design of energy-efficient and reliable ReRAM crossbar-based processing-in-memory (PIM) architectures for deep neural network (DNN) inferencing. To exploit sparsity and tackle crossbar nonidealities, matrix-vector-multiplication (MVM) operations are computed at a much smaller level of granularity than a full crossbar, referred to as OUs. However, determining the suitable OU size for a given DNN workload presents a nontrivial challenge as the DNN layers exhibit different levels of sparsity and have varying impact on overall predictive accuracy. In this article, we propose a framework for designing heterogeneous OU-based PIM accelerators. The OU configurations vary based on the characteristics of the neural layers and the time-dependent conductance drift of PIM devices due to repeated inference runs. Overall, our experimental results demonstrate that the sparsity-aware layer-wise heterogeneous OU-based PIM computation reduces latency and energy by 34% and 73% on average, respectively, compared to state-of-the-art homogeneous OU-based architectures without compromising the predictive accuracy.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2130-2143"},"PeriodicalIF":2.7,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}