Joonsik Yoon;Hayoung Lee;Youngki Moon;Seung Ho Shin;Sungho Kang
{"title":"A Built-In Self-Repair With Maximum Fault Collection and Fast Analysis Method for HBM","authors":"Joonsik Yoon;Hayoung Lee;Youngki Moon;Seung Ho Shin;Sungho Kang","doi":"10.1109/TCAD.2024.3499903","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3499903","url":null,"abstract":"High bandwidth memory (HBM) represents a significant advancement in memory technology, requiring quick and accurate data processing. Built-in self-repair (BISR) is crucial for ensuring high-capacity and reliable memories, as it automatically detects and repairs faults within memory systems, preventing data loss and enhancing overall memory reliability. The proposed BISR aims to enhance the repair rate and reliability by using a content-addressable memory structure that operates effectively in both offline and online modes. Furthermore, a new redundancy analysis algorithm reduces both analysis time and area overhead by converting fault information into a matrix format and focusing on fault-free areas for each repair solution. Experimental results demonstrate that the proposed BISR improves repair rates and derives a final repair solution immediately after the test sequences are completed. Moreover, hardware comparisons have shown that the proposed approach reduces the area overhead as memory size increases. Consequently, the proposed BISR enhances the overall performance of BISR and the reliability of HBM.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"2014-2025"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143860819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"COCO: Configuration-Based Compaction of a Compressed Topped-Off Test Set","authors":"Irith Pomeranz","doi":"10.1109/TCAD.2024.3499907","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3499907","url":null,"abstract":"Comprehensive defect coverage requires test sets that detect faults from several fault models. A test set is typically topped-off to detect faults from an additional fault model that are not already detected. This creates large test sets whose last tests detect small numbers of additional faults. Reducing the storage requirements of topped-off test sets (or test sets for fault models with large numbers of faults) is the topic of this article. Instead of storing the last tests in their entirety, it was shown previously that it is possible to produce the last tests of the test set from tests that appear earlier by complementing single bits. The storage requirements are reduced when only complemented bits are stored; however, the number of applied tests is increased. This article observes that changing the configuration by which decompressed test data are shifted into scan chains produces new tests that are effective in replacing tests at the end of a topped-off test set without increasing the number of applied tests. This approach is developed in this article in an academic environment and implemented using academic software tools. It is applied to benchmark circuits to demonstrate its effectiveness.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1991-1999"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143860757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiong Cheng;Pengfei Zhang;Yiqi Zhou;Rui Wang;Zhixiang Zhai;Youyou Fan;Wenhua Gu;Xiaodong Huang;Daying Sun
{"title":"Bridging the Gap From Vague Design Requirements to Feasible Structure: Deep Learning Model for Parameterized MEMS Sensor Design","authors":"Xiong Cheng;Pengfei Zhang;Yiqi Zhou;Rui Wang;Zhixiang Zhai;Youyou Fan;Wenhua Gu;Xiaodong Huang;Daying Sun","doi":"10.1109/TCAD.2024.3499897","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3499897","url":null,"abstract":"The design of MEMS sensor presents a significant challenge in identifying feasible structures that align with specific performance criteria. Traditionally, this process demands extensive design expertise and iterative simulations, leading to time-intensive workflows. While recent advancements have introduced deep learning (DL) models to expedite this process, they are limited to handling simple scenarios with precise performance values and fixed dimensions as inputs, often overlooking the uncertainty inherent in real design scenarios, such as vague range requirements and variable input dimensions. To address this issue, this study introduces a novel DL-based design model along with corresponding modeling strategies. The proposed model consists of a search network (SN), a validation network (VN), and a precision optimizer (PO). Initially, design requirements of various types and dimensions are transformed into a standardized input vector to address diverse design scenarios, which is then processed by the SN to generate a feasible structure. The VN, trained prior to the SN, validates the structure and generates training data for the SN. In cases where the model output fails to sufficiently align with the requirements, the PO is deployed to minimize the design error. Validation of the proposed model was conducted using a piezoresistive acceleration sensor across 100000 distinct design requirements. The results demonstrate an overall design accuracy (DA) of 92.64% on the testing data. Following 1000 iterations leveraging the proposed PO, the DA improves to 93.84%. Notably, each design iteration and optimization using the PO only requires approximately 0.1 ms, significantly boosting the design efficiency of MEMS sensors.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1845-1855"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143870970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Noninvasive Methodology for the Age Estimation of ICs Using Gaussian Process Regression","authors":"Anmol Singh Narwariya;Pabitra Das;Saqib Khursheed;Amit Acharyya","doi":"10.1109/TCAD.2024.3499893","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3499893","url":null,"abstract":"Age prediction for integrated circuits (ICs) is essential in establishing prevention and mitigation steps to avoid unexpected circuit failures in the field. Any electronic system would get benefit from an accurate age calculation. Additionally, it would assist in reducing the amount of electronic waste and the effort toward green computing. In this article, we propose a methodology to estimate the age of ICs using the Gaussian process regression (GPR). The output frequency of the ring oscillator (RO) is influenced by various factors, including the trackable path, voltage, temperature, and ageing. These dependencies are leveraged in the GPR model training. We demonstrate the RO’s frequency degradation by employing the Synopsys HSPICE tool with 32 nm predictive technology model (PTM) and the Synopsys technology library. We used temperature variation from 0 °C to 100 °C and voltage variation from 0.80 to 1.05 V for the data acquisition. Our methodology predicts age precisely; the minimum prediction accuracy with a month deviation on linear sampling rate is 85.36% for 13-Stage RO and 87.09% for 21-Stage RO, with a range of improvement in prediction accuracy compared to state-of-the-art (SOTA) is 9.74% to 16.99%. Similarly, on the logarithmic sampling rate, the prediction accuracy for 13-Stage RO and 21-Stage RO are 98.62% and 98.56%, respectively. The proposed methodology performs more accurately in terms of prediction accuracy and age prediction deviation from the SOTA methodology.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1833-1844"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SmartQCache: Fast and Precise Pulse Control With Near-Quantum Cache Design on FPGA","authors":"Liqiang Lu;Wuwei Tian;Xinghui Jia;Zixuan Song;Siwei Tan;Jianwei Yin","doi":"10.1109/TCAD.2024.3497839","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3497839","url":null,"abstract":"Quantum pulse serves as the machine language of superconducting quantum devices, which needs to be synthesized and calibrated for precise control of quantum operations. However, existing pulse control systems suffer from the dilemma between long synthesis latency and inaccuracy of quantum control systems. compute-in-CPU synthesis frameworks, like IBM Qiskit Pulse, involve massive redundant computation during pulse calculation, suffering from a high computational cost when handling large-scale circuits. On the other hand, field-programmable gate array (FPGA)-based synthesis frameworks, like QuMA, faces inaccurate pulse control problem. In this article, we propose both compute-in-CPU and all-in-FPGA solutions to collaboratively solve the latency and inaccuracy problem. First, we propose QPulseLib, a novel compute-in-CPU library with reusable pulses that can directly provide the pulse of a circuit pattern. To establish this library, we transform the circuit and apply convolutional operators to extract reusable patterns and precalculate their resultant pulses. Then, we develop a matching algorithm to identify such patterns shared by the target circuit. Experiments show that QPulseLib achieves <inline-formula> <tex-math>$158.46times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$16.03times $ </tex-math></inline-formula> speedup for pulse calculation, compared to Qiskit Pulse and AccQOC. Moreover, we extend the design as a fast and precise all-in-FPGA pulse control approach using near-quantum cache design, SmartQCache. To be specific, we employ a two-level cache to hold reusable pulses of frequently-used circuit patterns. Such a design enables pulse prefetching in near-quantum peripherals, dramatically reducing the end-to-end synthesis latency. To achieve precise pulse control, SmartQCache incorporates duration optimization and pulse sequence calibration to mitigate the execution errors from imperfect hardware, crosstalk, and time shift. Experimental results demonstrate that SmartQCache achieves <inline-formula> <tex-math>$294.37times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$145.43times $ </tex-math></inline-formula> speedup in pulse synthesis compared to Qiskit Pulse and AccQOC. It also reduces the pulse inaccuracy by <inline-formula> <tex-math>$1.27times $ </tex-math></inline-formula> compared to QuMA.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1704-1716"},"PeriodicalIF":2.7,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143870930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vesper: A Versatile Sparse Linear Algebra Accelerator With Configurable Compute Patterns","authors":"Hanchen Jin;Zichao Yue;Zhongyuan Zhao;Yixiao Du;Chenhui Deng;Nitish Srivastava;Zhiru Zhang","doi":"10.1109/TCAD.2024.3496882","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3496882","url":null,"abstract":"Sparse linear algebra (SLA) operations are fundamental building blocks for many important applications, such as data analytics, graph processing, machine learning, and scientific computing. In particular, four compute kernels in SLA are widely used, including sparse-matrix-dense-vector multiplication, sparse-matrix-dense-matrix multiplication, sparse-matrix-sparse-vector multiplication, and sparse-matrix-sparse-matrix multiplication. Recently, an active area of research has emerged to build specialized hardware accelerators for these SLA kernels. However, existing efforts mostly focus on accelerating a single kernel and the proposed accelerator architectures are often limited to a specific compute pattern, such as inner, outer, or row-wise product. This work proposes Vesper, a high-performance and versatile sparse accelerator that supports all four important SLA kernels while being configurable to execute the compute patterns suitable for different kernels under various degrees of sparsity. To enable rapid exploration of the large architectural design and configuration space, we devise an analytical model to estimate the performance of an SLA kernel running on a given hardware configuration using a specific compute pattern. Guided by our model, we build a flexible yet efficient accelerator architecture that maximizes the resource sharing amongst the hardware modules used for different SLA kernels and the associated compute patterns. We evaluate the performance of Vesper using gem5 on a diverse set of matrices from SuiteSparse. Our experiment results show that Vesper achieves a comparable or higher throughput with increased bandwidth efficiency than the state-of-the-art accelerators that are tailor-made for a specific SLA kernel. In addition, we evaluate Vesper on a real-world application called label propagation (LP), an iterative graph-based learning algorithm that involves multiple SLA kernels and exhibits varying degrees of sparsity across iterations. Compared to CPU- and GPU-based executions, Vesper speeds up the LP algorithm by <inline-formula> <tex-math>$12.0times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.7times $ </tex-math></inline-formula>, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1731-1744"},"PeriodicalIF":2.7,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143870929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PIMCOMP: An End-to-End DNN Compiler for Processing-In-Memory Accelerators","authors":"Xiaotian Sun;Xinyu Wang;Wanqian Li;Yinhe Han;Xiaoming Chen","doi":"10.1109/TCAD.2024.3496847","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3496847","url":null,"abstract":"In the past decade, various processing-in-memory (PIM) accelerators based on various devices, micro-architectures, and interfaces have been proposed to accelerate deep neural networks (DNNs). How to deploy DNNs onto PIM-based accelerators is the key to explore PIM’s high performance and energy efficiency. The scale of DNN models, the diversity of PIM accelerators, and the complexity of deployment are far beyond the human deployment capability. Hence, an automatic deployment methodology is indispensable. In this work, we propose PIMCOMP, an end-to-end DNN compiler tailored for PIM accelerators, achieving efficient deployment of DNN models on PIM hardware. PIMCOMP can adapt to various PIM architectures by using an abstract configurable PIM accelerator template with a set of pseudo instructions, which is a high-level abstraction of the hardware’s fundamental functionalities. Through a generic multilevel optimization framework, PIMCOMP realizes an end-to-end conversion from a high-level DNN description to pseudo instructions, which can be further converted to specific hardware intrinsics/primitives. The compilation addresses two critical issues in PIM-accelerated inference from a system perspective: 1) resource utilization and 2) dataflow scheduling. PIMCOMP adopts a flexible unfolding format to reshape and partition convolutional layers, adopts a weight-layout guided computation-storage-mapping approach to enhance resource utilization, and balances the system’s computation, memory access, and communication characteristics. For dataflow scheduling, we design two scheduling algorithms with different interlayer pipeline granularities to support varying application scenarios while ensuring high-computational parallelism. Experiments demonstrate that PIMCOMP improves throughput, latency, and energy efficiency across various architectures. PIMCOMP is open-sourced at <uri>https://github.com/sunxt99/PIMCOMP-NN</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1745-1759"},"PeriodicalIF":2.7,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAFE: A Scalable Homomorphic Encryption Accelerator for Vertical Federated Learning","authors":"Zhaohui Chen;Zhen Gu;Yanheng Lu;Xuanle Ren;Ruiguang Zhong;Wen-Jie Lu;Jiansong Zhang;Yichi Zhang;Hanghang Wu;Xiaofu Zheng;Heng Liu;Tingqiang Chu;Cheng Hong;Changzheng Wei;Dimin Niu;Yuan Xie","doi":"10.1109/TCAD.2024.3496836","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3496836","url":null,"abstract":"Privacy preservation has become a critical concern for governments, hospitals, and large corporations. Homomorphic encryption (HE) enables a ciphertext-based computation paradigm with strong security guarantees. In emerging cross-agency data cooperation scenarios like vertical federated learning (VFL), HE protects the data interaction from exposure to counterparts. However, computation on ciphertext has significant performance challenges due to increased data size and substantial overhead. Related work has been proposed to accelerate HE using parallel hardware, such as GPUs, FPGAs, and ASICs. However, many existing hardware accelerators target specific HE operations, such as number theoretic transform (NTT) and key switching, providing limited performance improvement for end-to-end applications. Others support bootstrapping, which requires quite a large ASIC design. To better support existing VFL training applications, we propose SAFE, an HE accelerator for scalable homomorphic matrix-vector products (HMVPs), which is the performance bottleneck. SAFE adopts a coefficient-wise encoded HMVP algorithm, despite a vanilla mode, we further explore the compressed and concatenated modes, which can fully utilize the polynomial encoding slots. The proposed hardware architecture, customized for HMVP dataflow, supports spatial and temporal parallelization of function units. The most costly polynomial function, NTT, is implemented with a low-area constant geometry unit which improves efficiency by <inline-formula> <tex-math>$2.43times $ </tex-math></inline-formula>. SAFE is implemented as a CPU-FPGA heterogeneous acceleration system, unleashing the multithread potential. The evaluation demonstrates an up to <inline-formula> <tex-math>$36times $ </tex-math></inline-formula> speed-up in end-to-end federated logistic regression training.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1662-1675"},"PeriodicalIF":2.7,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Formal Verification of Virtualization-Based Trusted Execution Environments","authors":"Hasini Witharana;Hansika Weerasena;Prabhat Mishra","doi":"10.1109/TCAD.2024.3443008","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443008","url":null,"abstract":"Trusted execution environments (TEEs) provide a secure environment for computation, ensuring that the code and data inside the TEE are protected with respect to confidentiality and integrity. Virtual machine (VM)-based TEEs extend this concept by utilizing virtualization technology to create isolated execution spaces that can support a complete operating system or specific applications. As the complexity and importance of VM-based TEEs grow, ensuring their reliability and security through formal verification becomes crucial. However, these technologies often operate without formal assurances of their security properties. Our research introduces a formal framework for representing and verifying VM-based TEEs. This approach provides a rigorous foundation for defining and verifying key security attributes for safeguarding execution environments. To demonstrate the applicability of our verification framework, we conduct an analysis of real-world TEE platforms, including Intel’s trust domain extensions (TDX). This work not only emphasizes the necessity of formal verification in enhancing the security of VM-based TEEs but also provides a systematic approach for evaluating the resilience of these platforms against sophisticated adversarial models.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4262-4273"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems publication information","authors":"","doi":"10.1109/TCAD.2024.3479791","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3479791","url":null,"abstract":"","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"C3-C3"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10745784","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}