{"title":"Delay-Driven Rectilinear Steiner Tree Construction","authors":"Hongxi Wu;Xingquan Li;Liang Chen;Bei Yu;Wenxing Zhu","doi":"10.1109/TCAD.2024.3501932","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3501932","url":null,"abstract":"Timing-driven routing is crucial in complex circuit design. Existing shallow-light Steiner tree construction methods balance between wire length (WL) and source-sink path length (PL) but lack in delay. Conversely, previous delay-driven methods prioritize delay but result in longer WL and PL, making them suboptimal. In this article, we show that simultaneously reducing the WL and PL can effectively reduce the delay. Furthermore, we investigate how delay changes during the reduction of PL. Guided by the theoretical findings, we develop a rectilinear shallow-light Steiner tree construction algorithm designed to reduce delay meanwhile maintaining a bounded WL. Furthermore, a delay-driven edge shifting algorithm is proposed to fine tune the tree’s topology, further reducing delay. We show that our proposed edge shifting algorithm can return a local Pareto optimal solution when repeatedly applied. Experimental results show that our algorithm achieves the lowest total delay compared to previous methods while maintaining competitive WL. Moreover, for nets with pins that have timing information, our algorithm can generate the most suitable Steiner Tree based on the timing information. In addition, extended experiments highlight the positive impact of constructing rectilinear Steiner trees with minimized total delay. Our codes will be available at <uri>https://github.com/Whx97/Delay-driven-Steiner-Tree</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1928-1941"},"PeriodicalIF":2.7,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143860818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyang Yu;Su Zheng;Wenqian Zhao;Shuo Yin;Xiaoxiao Liang;Guojin Chen;Yuzhe Ma;Bei Yu;Martin D. F. Wong
{"title":"RuleLearner: OPC Rule Extraction From Inverse Lithography Technique Engine","authors":"Ziyang Yu;Su Zheng;Wenqian Zhao;Shuo Yin;Xiaoxiao Liang;Guojin Chen;Yuzhe Ma;Bei Yu;Martin D. F. Wong","doi":"10.1109/TCAD.2024.3499909","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3499909","url":null,"abstract":"Model-based optical proximity correction (OPC) with subresolution assist feature (SRAF) generation is a critical standard practice for compensating lithography distortions in the fabrication of integrated circuits at advanced technology nodes. Typical model-based OPC and SRAF algorithms involve the selection of user-controlled rule parameters. Conventionally, these rules are heuristically determined and applied globally throughout the correction regions, which can be time consuming and require expert knowledge of the tool. Additionally, the correlations of rule parameters to the objectives are highly nonlinear. All these factors make designing a high-performance OPC engine for complex metal designs a nontrivial task. This article proposes RuleLearner, a comprehensive mask optimization system designed for SRAF generation and model-based OPC in real industrial scenarios. The proposed framework learns from the guidance of an information-augmented inverse lithography technique engine, which, although expressive for complex designs, is expensive to generate refined masks for a whole set of design clips. Considering the nonlinearity and the tradeoff between local and global performance, the extracted rule value distributions are further optimized with customized natural gradients. The sophisticated SRAF generation, the edge segmentation and movements are then guided by the rule parameter. Experimental results show that RuleLearner can be applied across different complex design patterns and achieve the best lithographic performance and computational efficiency.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1915-1927"},"PeriodicalIF":2.7,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143860952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Direct Search Procedure for Functional Compaction With Improved Fault Coverage","authors":"Irith Pomeranz","doi":"10.1109/TCAD.2024.3499898","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3499898","url":null,"abstract":"An important component of ensuring system reliability is the application of functional tests. Functional test sequences are available after simulation-based design verification, they can be extracted from application programs, or generated for target faults. Functional test sequences can be long, and test compaction at the gate-level is important for reducing the test application time without losing fault coverage. Experimental results with several test compaction procedures indicate that test compaction sometimes leads accidentally to an increased fault coverage. Such an increase was observed recently with a gate-level test compaction procedure that has the unique property of restoring functional operation conditions after parts of a sequence are eliminated. The contribution of this article is to use this property of the test compaction procedure to increase the fault coverage directly, in a targeted manner, while compacting the sequence. Experimental results for benchmark circuits in an academic environment demonstrate a significant fault coverage increase combined with significant test compaction.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1981-1990"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143860816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joonsik Yoon;Hayoung Lee;Youngki Moon;Seung Ho Shin;Sungho Kang
{"title":"A Built-In Self-Repair With Maximum Fault Collection and Fast Analysis Method for HBM","authors":"Joonsik Yoon;Hayoung Lee;Youngki Moon;Seung Ho Shin;Sungho Kang","doi":"10.1109/TCAD.2024.3499903","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3499903","url":null,"abstract":"High bandwidth memory (HBM) represents a significant advancement in memory technology, requiring quick and accurate data processing. Built-in self-repair (BISR) is crucial for ensuring high-capacity and reliable memories, as it automatically detects and repairs faults within memory systems, preventing data loss and enhancing overall memory reliability. The proposed BISR aims to enhance the repair rate and reliability by using a content-addressable memory structure that operates effectively in both offline and online modes. Furthermore, a new redundancy analysis algorithm reduces both analysis time and area overhead by converting fault information into a matrix format and focusing on fault-free areas for each repair solution. Experimental results demonstrate that the proposed BISR improves repair rates and derives a final repair solution immediately after the test sequences are completed. Moreover, hardware comparisons have shown that the proposed approach reduces the area overhead as memory size increases. Consequently, the proposed BISR enhances the overall performance of BISR and the reliability of HBM.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"2014-2025"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143860819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"COCO: Configuration-Based Compaction of a Compressed Topped-Off Test Set","authors":"Irith Pomeranz","doi":"10.1109/TCAD.2024.3499907","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3499907","url":null,"abstract":"Comprehensive defect coverage requires test sets that detect faults from several fault models. A test set is typically topped-off to detect faults from an additional fault model that are not already detected. This creates large test sets whose last tests detect small numbers of additional faults. Reducing the storage requirements of topped-off test sets (or test sets for fault models with large numbers of faults) is the topic of this article. Instead of storing the last tests in their entirety, it was shown previously that it is possible to produce the last tests of the test set from tests that appear earlier by complementing single bits. The storage requirements are reduced when only complemented bits are stored; however, the number of applied tests is increased. This article observes that changing the configuration by which decompressed test data are shifted into scan chains produces new tests that are effective in replacing tests at the end of a topped-off test set without increasing the number of applied tests. This approach is developed in this article in an academic environment and implemented using academic software tools. It is applied to benchmark circuits to demonstrate its effectiveness.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1991-1999"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143860757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiong Cheng;Pengfei Zhang;Yiqi Zhou;Rui Wang;Zhixiang Zhai;Youyou Fan;Wenhua Gu;Xiaodong Huang;Daying Sun
{"title":"Bridging the Gap From Vague Design Requirements to Feasible Structure: Deep Learning Model for Parameterized MEMS Sensor Design","authors":"Xiong Cheng;Pengfei Zhang;Yiqi Zhou;Rui Wang;Zhixiang Zhai;Youyou Fan;Wenhua Gu;Xiaodong Huang;Daying Sun","doi":"10.1109/TCAD.2024.3499897","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3499897","url":null,"abstract":"The design of MEMS sensor presents a significant challenge in identifying feasible structures that align with specific performance criteria. Traditionally, this process demands extensive design expertise and iterative simulations, leading to time-intensive workflows. While recent advancements have introduced deep learning (DL) models to expedite this process, they are limited to handling simple scenarios with precise performance values and fixed dimensions as inputs, often overlooking the uncertainty inherent in real design scenarios, such as vague range requirements and variable input dimensions. To address this issue, this study introduces a novel DL-based design model along with corresponding modeling strategies. The proposed model consists of a search network (SN), a validation network (VN), and a precision optimizer (PO). Initially, design requirements of various types and dimensions are transformed into a standardized input vector to address diverse design scenarios, which is then processed by the SN to generate a feasible structure. The VN, trained prior to the SN, validates the structure and generates training data for the SN. In cases where the model output fails to sufficiently align with the requirements, the PO is deployed to minimize the design error. Validation of the proposed model was conducted using a piezoresistive acceleration sensor across 100000 distinct design requirements. The results demonstrate an overall design accuracy (DA) of 92.64% on the testing data. Following 1000 iterations leveraging the proposed PO, the DA improves to 93.84%. Notably, each design iteration and optimization using the PO only requires approximately 0.1 ms, significantly boosting the design efficiency of MEMS sensors.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1845-1855"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143870970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Noninvasive Methodology for the Age Estimation of ICs Using Gaussian Process Regression","authors":"Anmol Singh Narwariya;Pabitra Das;Saqib Khursheed;Amit Acharyya","doi":"10.1109/TCAD.2024.3499893","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3499893","url":null,"abstract":"Age prediction for integrated circuits (ICs) is essential in establishing prevention and mitigation steps to avoid unexpected circuit failures in the field. Any electronic system would get benefit from an accurate age calculation. Additionally, it would assist in reducing the amount of electronic waste and the effort toward green computing. In this article, we propose a methodology to estimate the age of ICs using the Gaussian process regression (GPR). The output frequency of the ring oscillator (RO) is influenced by various factors, including the trackable path, voltage, temperature, and ageing. These dependencies are leveraged in the GPR model training. We demonstrate the RO’s frequency degradation by employing the Synopsys HSPICE tool with 32 nm predictive technology model (PTM) and the Synopsys technology library. We used temperature variation from 0 °C to 100 °C and voltage variation from 0.80 to 1.05 V for the data acquisition. Our methodology predicts age precisely; the minimum prediction accuracy with a month deviation on linear sampling rate is 85.36% for 13-Stage RO and 87.09% for 21-Stage RO, with a range of improvement in prediction accuracy compared to state-of-the-art (SOTA) is 9.74% to 16.99%. Similarly, on the logarithmic sampling rate, the prediction accuracy for 13-Stage RO and 21-Stage RO are 98.62% and 98.56%, respectively. The proposed methodology performs more accurately in terms of prediction accuracy and age prediction deviation from the SOTA methodology.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1833-1844"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SmartQCache: Fast and Precise Pulse Control With Near-Quantum Cache Design on FPGA","authors":"Liqiang Lu;Wuwei Tian;Xinghui Jia;Zixuan Song;Siwei Tan;Jianwei Yin","doi":"10.1109/TCAD.2024.3497839","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3497839","url":null,"abstract":"Quantum pulse serves as the machine language of superconducting quantum devices, which needs to be synthesized and calibrated for precise control of quantum operations. However, existing pulse control systems suffer from the dilemma between long synthesis latency and inaccuracy of quantum control systems. compute-in-CPU synthesis frameworks, like IBM Qiskit Pulse, involve massive redundant computation during pulse calculation, suffering from a high computational cost when handling large-scale circuits. On the other hand, field-programmable gate array (FPGA)-based synthesis frameworks, like QuMA, faces inaccurate pulse control problem. In this article, we propose both compute-in-CPU and all-in-FPGA solutions to collaboratively solve the latency and inaccuracy problem. First, we propose QPulseLib, a novel compute-in-CPU library with reusable pulses that can directly provide the pulse of a circuit pattern. To establish this library, we transform the circuit and apply convolutional operators to extract reusable patterns and precalculate their resultant pulses. Then, we develop a matching algorithm to identify such patterns shared by the target circuit. Experiments show that QPulseLib achieves <inline-formula> <tex-math>$158.46times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$16.03times $ </tex-math></inline-formula> speedup for pulse calculation, compared to Qiskit Pulse and AccQOC. Moreover, we extend the design as a fast and precise all-in-FPGA pulse control approach using near-quantum cache design, SmartQCache. To be specific, we employ a two-level cache to hold reusable pulses of frequently-used circuit patterns. Such a design enables pulse prefetching in near-quantum peripherals, dramatically reducing the end-to-end synthesis latency. To achieve precise pulse control, SmartQCache incorporates duration optimization and pulse sequence calibration to mitigate the execution errors from imperfect hardware, crosstalk, and time shift. Experimental results demonstrate that SmartQCache achieves <inline-formula> <tex-math>$294.37times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$145.43times $ </tex-math></inline-formula> speedup in pulse synthesis compared to Qiskit Pulse and AccQOC. It also reduces the pulse inaccuracy by <inline-formula> <tex-math>$1.27times $ </tex-math></inline-formula> compared to QuMA.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1704-1716"},"PeriodicalIF":2.7,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143870930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vesper: A Versatile Sparse Linear Algebra Accelerator With Configurable Compute Patterns","authors":"Hanchen Jin;Zichao Yue;Zhongyuan Zhao;Yixiao Du;Chenhui Deng;Nitish Srivastava;Zhiru Zhang","doi":"10.1109/TCAD.2024.3496882","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3496882","url":null,"abstract":"Sparse linear algebra (SLA) operations are fundamental building blocks for many important applications, such as data analytics, graph processing, machine learning, and scientific computing. In particular, four compute kernels in SLA are widely used, including sparse-matrix-dense-vector multiplication, sparse-matrix-dense-matrix multiplication, sparse-matrix-sparse-vector multiplication, and sparse-matrix-sparse-matrix multiplication. Recently, an active area of research has emerged to build specialized hardware accelerators for these SLA kernels. However, existing efforts mostly focus on accelerating a single kernel and the proposed accelerator architectures are often limited to a specific compute pattern, such as inner, outer, or row-wise product. This work proposes Vesper, a high-performance and versatile sparse accelerator that supports all four important SLA kernels while being configurable to execute the compute patterns suitable for different kernels under various degrees of sparsity. To enable rapid exploration of the large architectural design and configuration space, we devise an analytical model to estimate the performance of an SLA kernel running on a given hardware configuration using a specific compute pattern. Guided by our model, we build a flexible yet efficient accelerator architecture that maximizes the resource sharing amongst the hardware modules used for different SLA kernels and the associated compute patterns. We evaluate the performance of Vesper using gem5 on a diverse set of matrices from SuiteSparse. Our experiment results show that Vesper achieves a comparable or higher throughput with increased bandwidth efficiency than the state-of-the-art accelerators that are tailor-made for a specific SLA kernel. In addition, we evaluate Vesper on a real-world application called label propagation (LP), an iterative graph-based learning algorithm that involves multiple SLA kernels and exhibits varying degrees of sparsity across iterations. Compared to CPU- and GPU-based executions, Vesper speeds up the LP algorithm by <inline-formula> <tex-math>$12.0times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.7times $ </tex-math></inline-formula>, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1731-1744"},"PeriodicalIF":2.7,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143870929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PIMCOMP: An End-to-End DNN Compiler for Processing-In-Memory Accelerators","authors":"Xiaotian Sun;Xinyu Wang;Wanqian Li;Yinhe Han;Xiaoming Chen","doi":"10.1109/TCAD.2024.3496847","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3496847","url":null,"abstract":"In the past decade, various processing-in-memory (PIM) accelerators based on various devices, micro-architectures, and interfaces have been proposed to accelerate deep neural networks (DNNs). How to deploy DNNs onto PIM-based accelerators is the key to explore PIM’s high performance and energy efficiency. The scale of DNN models, the diversity of PIM accelerators, and the complexity of deployment are far beyond the human deployment capability. Hence, an automatic deployment methodology is indispensable. In this work, we propose PIMCOMP, an end-to-end DNN compiler tailored for PIM accelerators, achieving efficient deployment of DNN models on PIM hardware. PIMCOMP can adapt to various PIM architectures by using an abstract configurable PIM accelerator template with a set of pseudo instructions, which is a high-level abstraction of the hardware’s fundamental functionalities. Through a generic multilevel optimization framework, PIMCOMP realizes an end-to-end conversion from a high-level DNN description to pseudo instructions, which can be further converted to specific hardware intrinsics/primitives. The compilation addresses two critical issues in PIM-accelerated inference from a system perspective: 1) resource utilization and 2) dataflow scheduling. PIMCOMP adopts a flexible unfolding format to reshape and partition convolutional layers, adopts a weight-layout guided computation-storage-mapping approach to enhance resource utilization, and balances the system’s computation, memory access, and communication characteristics. For dataflow scheduling, we design two scheduling algorithms with different interlayer pipeline granularities to support varying application scenarios while ensuring high-computational parallelism. Experiments demonstrate that PIMCOMP improves throughput, latency, and energy efficiency across various architectures. PIMCOMP is open-sourced at <uri>https://github.com/sunxt99/PIMCOMP-NN</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1745-1759"},"PeriodicalIF":2.7,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}