Haibin Zhao;Priyanjana Pal;Michael Hefenbrock;Yuhong Wang;Michael Beigl;Mehdi B. Tahoori
{"title":"Neural Evolutionary Architecture Search for Compact Printed Analog Neuromorphic Circuits","authors":"Haibin Zhao;Priyanjana Pal;Michael Hefenbrock;Yuhong Wang;Michael Beigl;Mehdi B. Tahoori","doi":"10.1109/TCAD.2024.3524357","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3524357","url":null,"abstract":"Printed electronics (PEs) is an additive fabrication technology which not only allows for a highly flexible printing of circuit patterns, but also produce soft, nontoxic, and degradable electronics at an extremely low cost. These properties make PE an enabler of new application domains, e.g., fast moving consumer goods and disposable healthcare devices. A particularly promising class of circuits in this technology is the printed analog neuromorphic circuits, offering efficient and highly tailored computational functionalities. In this work, we leverage the highly flexible fabrication process of PE to address the bottleneck of PE, i.e., the large feature sizes and low device counts. This issue is crucial, as it impairs the integration of printed circuits into target applications with limited footprint, such as smart band-aids. We propose an evolutionary algorithm (EA) to improve the circuit compactness through circuit architecture optimization. As baseline, we compare the proposed EA method with a state-of-the-art pruning method and a modified area-aware pruning method. All of them are able to optimize circuit architecture. Experimental simulation reveals that the proposed EA approach can effectively achieve compact circuits and outperform the pruning method by <inline-formula> <tex-math>$3.1times $ </tex-math></inline-formula> lower area with no loss of accuracy. As a byproduct, the power is reduced by <inline-formula> <tex-math>$3.0times $ </tex-math></inline-formula>, paving the way to energy-harvested printed systems.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2655-2668"},"PeriodicalIF":2.7,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changxu Liu;Hao Zhou;Lan Yang;Zheng Wu;Patrick Dai;Yinlong Li;Shiyong Wu;Fan Yang
{"title":"Myosotis: An Efficiently Pipelined and Parameterized Multiscalar Multiplication Architecture via Data Sharing","authors":"Changxu Liu;Hao Zhou;Lan Yang;Zheng Wu;Patrick Dai;Yinlong Li;Shiyong Wu;Fan Yang","doi":"10.1109/TCAD.2024.3524364","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3524364","url":null,"abstract":"Zero-knowledge proof (ZKP) is a widely used privacy-preserving technology, where multiscalar multiplication (MSM) accounts for over 70% of the computational workload. The acceleration of MSM can enhance the overall performance of ZKP, making it a focal point of community attention. However, in practical applications involving the deployment of multiple MSM accelerators, existing designs often overlook strategies for optimizing bandwidth and area efficiency. To address this, we propose Myosotis, an efficiently pipelined and parameterized MSM architecture. By sharing input data and allocating cache effectively, it mitigates average transmission bandwidth in runtime. Myosotis also supports the use of multiple point addition (PADD) units to achieve performance gains, balancing area overhead and latency for improved area efficiency. Different parameter selection enables a tradeoff between the performance, area, and bandwidth of the MSM accelerator. When benchmarking with MSM degrees between <inline-formula> <tex-math>$2^{18}$ </tex-math></inline-formula> and <inline-formula> <tex-math>$2^{26}$ </tex-math></inline-formula>, our proposed baseline design achieves up to <inline-formula> <tex-math>$3.32times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$6.72times $ </tex-math></inline-formula> speedups over state-of-the-art FPGA and ASIC designs. Compared to the baseline, Myosotis with two window MSMs and one PADD unit reduces bandwidth demand by 43% while maintaining similar area and latency. On the other hand, Myosotis with three window MSMs and two PADD units decreases latency by 43% and bandwidth by 17%, with only a 9% area increase.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2738-2750"},"PeriodicalIF":2.7,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Connectivity-Agnostic Built-In Self-Repair of Interconnects in a Chiplet IC","authors":"Chi Lai;Shi-Yu Huang","doi":"10.1109/TCAD.2024.3524473","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3524473","url":null,"abstract":"In a chiplet IC, several dice are integrated through die-to-die interconnects. Technical challenges still exist for the repair of these die-to-die interconnects to boost the overall manufacturing yield. In this work, we propose a novel Connectivity-Agnostic Built-In Self-Repair (BISR) scheme for chiplet ICs. In our scheme, the design-for-BISR circuit inserted in each functional die except the master die is independent of the die-to-die connectivity so that a nonmaster die can be repeatedly reused in many chiplet ICs while supporting in-the-field repair of faulty interconnects to boost the manufacturing yield and in-the-field reliability.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2451-2460"},"PeriodicalIF":2.7,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimization of Droplet Routing in Microfluidic Biochips Using Calibrated Droplet-Shape Morphing","authors":"Arun Sankar Eenhakkattu Mana;Navajit Singh Baban;Maolin Zhang;Dongping Wang;Hanbin Ma;Tsung-Yi Ho;Krishnendu Chakrabarty","doi":"10.1109/TCAD.2024.3524475","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3524475","url":null,"abstract":"Advanced digital microfluidic biochips based on technologies such as micro-electrode-dot-array (MEDA) and active matrix (AM) provide enhanced functionality compared to conventional biochips. Owing to the larger ratio of droplet size to electrode size, these platforms allow finer control of droplets and diagonal movement. Additionally, they allow dynamic grouping of micro-electrodes to form subsystems that can perform fluidic operations. Shape morphing is a key feature of MEDA/AM biochips that results in faster fluidic operations, thereby improving the efficiency of bioassays. To establish the benefits of shape morphing, we first numerically simulate the shape morphing operation. We employ a simplified 2-D flow model incorporating interface tracking through a level-set method to numerically simulate shape morphing induced by micro-electrode actuation. We also validate our numerical results with COMSOL simulations and experiments performed on MEDA/AM biochips. The validated shape morphing operations are subsequently used to optimize droplet routing for benchmark bioassays. We propose an algorithm to significantly reduce the size of the routing problem and the time needed to solve it. With the help of this improved approach, we show that droplet morphing operations reduce the time needed to complete bioassays.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2683-2696"},"PeriodicalIF":2.7,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine-Grained Structured Sparse Computing for FPGA-Based AI Inference","authors":"Chen Zhang;Shijie Cao;Guohao Dai;Chenbo Geng;Zhuliang Yao;Wencong Xiao;Yunxin Liu;Ming Wu;Lintao Zhang;Guangyu Sun;Zhigang Ji;Runsheng Wang;Ru Huang","doi":"10.1109/TCAD.2024.3524356","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3524356","url":null,"abstract":"With the explosive growth in the number of parameters in deep neural networks (DNNs), sparsity-centric algorithm and hardware designs have become critical for low-latency AI serving systems. However, the inherent randomness in pruning methods often leads to fragmented data access and irregular computation patterns in sparse matrices, resulting in significantly reduced hardware efficiency. Addressing the balance between the ‘randomness’ required to maintain model accuracy and the ‘regularity’ needed for efficient hardware design is crucial for realizing effective sparse computing in AI. This article proposes a fine-grained structured sparsity (FSS) paradigm. The pruned sparse matrices in this paradigm exhibit characteristics of ‘local randomness’ and ‘global regularity’. This dual-feature design allows AI accelerator hardware based on the FSS paradigm to maintain both high model accuracy and efficient hardware design. We implemented this novel accelerator on the Xilinx Alveo U280 and validated our concept across three different AI models, including CNN, RNN, and LLM, demonstrating performance that significantly outperforms prior methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2544-2557"},"PeriodicalIF":2.7,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junfeng Liu;Liwei Ni;Lei Chen;Xing Li;Qinghua Zhao;Xingquan Li;Shuai Ma
{"title":"A Delay-Driven Iterative Technology Mapping Framework","authors":"Junfeng Liu;Liwei Ni;Lei Chen;Xing Li;Qinghua Zhao;Xingquan Li;Shuai Ma","doi":"10.1109/TCAD.2024.3524463","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3524463","url":null,"abstract":"Technology mapping is the pivotal synthesis step that translates abstract logical models into technology-dependent implementations using the designated library, e.g., standard cells for ASICs. The efficient solutions heavily rely on the gate selection guided by estimated delay. However, estimating these delays is sophisticated due to the absence of actual interconnect load and transition time during the mapping. In this article, we revisit the difficulties of the delay-driven mapping problem and explore three key insights to address these. Inspired by the insights, we first design a structure-aware load-slew model that integrates input transitions and output loads for gate delay estimations. Benefiting from the model, we propose a delay-iterative framework that progressively reduces the overall circuit delay by further aligning library characteristics with logical network structures. Finally, experiments with 130 nm and 7 nm libraries show its superiority, which averagely reduces circuit delay by 10% with nonlinear delay model, and 6% in delay after P&R, as compared to ABC.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2585-2598"},"PeriodicalIF":2.7,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and Analysis of Optimization Method for Ultra-Wideband PA Based on Improved MOEA/D Algorithm Using Mixed Objective Function","authors":"Zhongpeng Ni;Jing Xia;Xinyu Zhou;Wa Kong;Heng Zhang;Chao Yu;Xiao-Wei Zhu","doi":"10.1109/TCAD.2024.3524333","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3524333","url":null,"abstract":"This article proposes a design and optimization method for Ultra-Wideband power amplifiers (PAs) using improved multiobjective evolutionary algorithm based on decomposition (MOEA/D) and mixed optimization objective function (MOOF). In order to address the insufficient optimization capabilities of conventional MOEA/D algorithm when dealing with complex Pareto fronts, the optimization algorithm is improved by using adaptive weights, neighborhoods, and global replacement. Initially, based on the sparsity of the population and an external population, the weight vectors corresponding to invalid or crowded individuals in the population are replaced. And, the neighborhoods are adaptively adjusted based on the number of iterations and the sparsity of each individual. Then, a global replacement is employed to accelerate the convergence process. Moreover, a MOOF using the load impedance, output power, efficiency and gain is conducted in the PA optimization design. To reduce the difficulty of impedance judgment, a construction method of impedance solution set based on Poisson disk sampling has been proposed and employed for optimization. For validation, an Ultra-Wideband PA operating at 100–4000 MHz (190.2% fractional bandwidth) was designed and optimized. Measured results show a drain efficiency ranging from 56.0% to 70.9% with an output power higher than 39 dBm at saturation within the whole operational frequency band.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2641-2654"},"PeriodicalIF":2.7,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GCN-ABFT: Low-Cost Online Error Checking for Graph Convolutional Networks","authors":"Christodoulos Peltekis;Giorgos Dimitrakopoulos","doi":"10.1109/TCAD.2024.3523425","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3523425","url":null,"abstract":"Graph convolutional networks (GCNs) are popular for building machine-learning application for graph-structured data. This widespread adoption led to the development of specialized GCN hardware accelerators. In this work, we address a key architectural challenge for GCN accelerators: how to detect errors in GCN computations arising from random hardware faults with the least computation cost. Each GCN layer performs a graph convolution, mathematically equivalent to multiplying three matrices, computed through two separate matrix multiplications. Existing algorithm-based fault tolerance (ABFT) techniques can check the results of individual matrix multiplications. However, for a GCN layer, this check should be performed twice. To avoid this overhead, this work introduces GCN-ABFT that directly calculates a checksum for the entire three-matrix product within a single GCN layer, providing a cost-effective approach for error detection in GCN accelerators. Experimental results demonstrate that GCN-ABFT reduces the number of operations needed for checksum computation by over 21% on average for representative GCN applications. These savings are achieved without sacrificing fault-detection accuracy, as evidenced by the presented fault-injection analysis.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2836-2840"},"PeriodicalIF":2.7,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SR-BIP: A Soft Error-Resilient Binary Neural Network Inference Processor","authors":"Gil-Ho Kwak;Jaeho Kim;Tae-Hwan Kim","doi":"10.1109/TCAD.2024.3523424","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3523424","url":null,"abstract":"This brief presents an efficient binary neural network inference processor, that is, resilient to soft errors caused by potential circuit faults. The proposed processor, SR-BIP, achieves error resilience based on a recompute-based error correction technique. The recompute is selectively performed by exploiting spatial locality inherent in a feature map, to minimize overhead. For the CIFAR10 task at 0.1% bit error rate, SR-BIP achieves 84.42% accuracy, which is 17.59% higher than that without any error resilience. Despite this error resilience, SR-BIP exhibits a resource efficiency of 75.27 MOP/s/LUT in a 28-nm FPGA, which is as high as that of the previous state-of-the-art processor designed without considering error resilience.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2822-2826"},"PeriodicalIF":2.7,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenkun Lin;Genggeng Liu;Xing Huang;Yibo Lin;Jixin Zhang;Wen-Hao Liu;Ting-Chi Wang
{"title":"A Unified Deep Reinforcement Learning Approach for Constructing Rectilinear and Octilinear Steiner Minimum Tree","authors":"Zhenkun Lin;Genggeng Liu;Xing Huang;Yibo Lin;Jixin Zhang;Wen-Hao Liu;Ting-Chi Wang","doi":"10.1109/TCAD.2024.3523429","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3523429","url":null,"abstract":"The Steiner minimum tree (SMT) serves as an optimal connection model for multiterminal nets in very large scale integration (VLSI). Constructing both rectilinear SMT (RSMT) and octilinear SMT (OSMT) are known to be NP-hard problems. Simultaneously, constructing multiple topologies of SMTs for a given net holds significant importance in alleviating routing constraints such as alleviating congestion and ensuring timing convergence. However, existing efforts predominantly focus on designing specialized methods to construct a specifically structured SMT for a given net, making it challenging to extend to different structures or topologies of SMTs, while also exhibiting insufficient optimization capabilities. In this work, we propose a unified approach based on deep reinforcement learning (DRL) to address both RSMT and OSMT problems while generating diverse routing topologies. First, we design an edge point sequence (EPS) that leverages the structural characteristics of SMT to connect the output of the deep learning model with the SMT structure. Second, we propose a deep learning model tailored for EPS, employing the negative wirelength of SMT as a reward to train the model using DRL. Third, we provide a corresponding rapid and accurate wirelength computation algorithm for evaluating the quality of the construction solution to expedite model training. Finally, we leverage the stochastic nature of machine learning to construct diverse SMT construction solutions. To the best of our knowledge, this is the first unified approach capable of simultaneously addressing both RSMT and OSMT problems while generating diverse solutions. The proposed unified approach demonstrates superior solution quality and higher efficiency compared to specifically designed algorithms.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2711-2724"},"PeriodicalIF":2.7,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}