Xing Guo;Jiajia Zhang;Xu Meng;Zhenmin Li;Xiaoqing Wen;Patrick Girard;Bin Liang;Aibin Yan
{"title":"HALTRAV: Design of a High-Performance and Area-Efficient Latch With Triple-Node-Upset Recovery and Algorithm-Based Verifications","authors":"Xing Guo;Jiajia Zhang;Xu Meng;Zhenmin Li;Xiaoqing Wen;Patrick Girard;Bin Liang;Aibin Yan","doi":"10.1109/TCAD.2024.3511335","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3511335","url":null,"abstract":"With the rapid advancement of semiconductor technologies, latches become increasingly sensitive to soft errors, especially triple node upsets (TNUs), in harsh radiation environments. In this article, we first propose a high-performance and area-efficient latch, namely, HALTRAV, featuring complete TNU-recovery. The storage portion of HALTRAV consists of 28 interlocked source-drain cross-coupled inverters (SCIs) for complete TNU-recovery with area efficiency and low delay. To mitigate the issue that node-upset-recovery verifications for existing latches highly relies on electronic design automation tools, we further propose an algorithm-based verification method that can automatically verify the node-upset-recovery of latches, which greatly simplifies the reliability-verification flow. Simulation results demonstrate the TNU-recovery of HALTRAV and also show that HALTRAV achieves 40.38%, 8.17%, and 31.89% reduction in delay, area, and delay-power–area product (DPAP) on average, respectively; however; it is at the cost of power as compared to typical latches that are TNU-recoverable. Comparison results also demonstrate the moderate sensitivity of HALTRAV to the impacts of the process, voltage, and temperature (PVT) variations.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2367-2377"},"PeriodicalIF":2.7,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144100072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Recursive Partition-Based In-Memory SIMD Computation Scheduler for Memory Footprint Minimization","authors":"Xingyue Qian;Chenyang Lv;Zhezhi He;Weikang Qian","doi":"10.1109/TCAD.2024.3511337","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3511337","url":null,"abstract":"In-memory computing (IMC) is a technique that enables memory to perform computation so that data transfer between processor and memory can be reduced, improving energy efficiency. A popular IMC design style is based on the single-instruction-multiple-data (SIMD) concept. The SIMD IMC can implement a high-level function by two steps: 1) synthesis and 2) scheduling. The former converts the high-level function into a netlist of the supported primitive logic operations, while the latter determines the execution sequence of the operations. To fully exploit the advantage of SIMD IMC, it is crucial to find a schedule for the given netlist with less memory usage, known as memory footprint (MF). In this work, we first propose an optimal scheduler that can minimize the MF for small netlists. It is at least <inline-formula> <tex-math>$8times $ </tex-math></inline-formula> faster than the state-of-the-art optimal method. For large netlists, we propose a recursive partition-based scheduler consisting of a scheduling-friendly bipartition algorithm and our optimal scheduler. Compared to four state-of-the-art heuristic methods, ours reduces the MF by 54.7%, 48.9%, 44.0%, and 25.5%, respectively, under the same runtime. Our experiments also demonstrate that our scheduler achieves good end-to-end performance when applied to various IMC architectures. The code of our scheduler is made open-source.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2105-2118"},"PeriodicalIF":2.7,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hans Jakob Damsgaard;Konstantin J. Hoßfeld;Jari Nurmi;Thomas B. Preußer
{"title":"Parallel Accurate Minifloat MACCs for Neural Network Inference on Versal FPGAs","authors":"Hans Jakob Damsgaard;Konstantin J. Hoßfeld;Jari Nurmi;Thomas B. Preußer","doi":"10.1109/TCAD.2024.3511343","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3511343","url":null,"abstract":"Machine learning (ML) is ubiquitous in contemporary applications. Its need for efficient acceleration has driven vast research efforts into the quantization of neural networks with low-precision numerical formats. Models quantized with minifloat formats of eight or fewer bits have proven capable of outperforming models quantized into same-size integers. However, unlike integers, minifloats require accurate accumulation to prevent the introduction of rounding errors. We explore the design space of parallel accurate minifloat multiply-accumulators (MACCs) targeting the AMD VersalTM FPGA fabric. We experiment with three variations of the multiply-and-shift and adder tree components of a minifloat MACC. For comparison, we apply similar alterations to a parallel integer MACC. Our results show that custom compressor trees with external sign-inversion gates reduce the mean area of the minifloat MACCs by 17.7% and increase their clock frequency by 16.2%. In comparison, custom compressor trees with absorbed partial product generation gates reduce the mean area of integer MACCs by 28.1% and increase their clock frequency by 3.60%. Comparing the best-performing designs, we observe that minifloat MACCs consume 20% to 180% more resources than integer ones with same-size operands without accounting for a conversion back into a floating-point format, and 60% to 300% more resources when including it. Our data enable engineers to make informed decisions in their designs of deeply integrated embedded ML solutions when trading off training and fine-tuning effort versus resource cost.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2181-2194"},"PeriodicalIF":2.7,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10777058","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongshang Li;Yu Zhang;Haoning Deng;Mingyu Chen;Zhenyu Li
{"title":"PauliForest: Connectivity-Aware Synthesis and Pauli-Oriented Qubit Mapping for Near-Term Quantum Simulation","authors":"Yongshang Li;Yu Zhang;Haoning Deng;Mingyu Chen;Zhenyu Li","doi":"10.1109/TCAD.2024.3509794","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3509794","url":null,"abstract":"Quantum simulation is the foundation for the design of many algorithms which share subroutines known as quantum simulation kernels. Optimizing the compilation of these kernels is crucial, involving two key components: 1) circuit synthesis and 2) qubit mapping. However, existing circuit synthesis methods either overlook qubit connectivity constraints (QCCs) or prioritize minimizing gate count over optimizing circuit depth. Similarly, current qubit mapping techniques do not work well with circuit synthesis methods. To address these limitations, we propose PauliForest, which comprises a connectivity-aware circuit synthesis algorithm and a Pauli-oriented qubit mapping algorithm. The synthesis algorithm employs heuristic strategies to generate shallower circuits, while the qubit mapping algorithm seamlessly collaborates with the circuit synthesis process. Compared to the state-of-the-art Paulihedral compiler, our approach significantly reduces both CNOT gate counts (by 13%) and circuit depths (by 25%). Experiments on a noisy simulator and a real superconducting quantum computer show that our algorithm can improve the fidelity of quantum circuit execution compared to Paulihedral.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2119-2129"},"PeriodicalIF":2.7,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chang Meng;Alan Mishchenko;Weikang Qian;Giovanni De Micheli
{"title":"Efficient Resubstitution-Based Approximate Logic Synthesis","authors":"Chang Meng;Alan Mishchenko;Weikang Qian;Giovanni De Micheli","doi":"10.1109/TCAD.2024.3510513","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3510513","url":null,"abstract":"Approximate computing is an emerging paradigm for designing error-resilient applications. It reduces circuit area, power, and delay at the cost of introducing errors. This article proposes a powerful technique, termed approximate resubstitution (AppResub), to approximately simplify the circuit. AppResub replaces a node’s function with a simpler approximate function on existing nodes in the circuit to reduce the hardware cost. Leveraging AppResub, an efficient flow for approximate logic synthesis (ALS) is developed by iteratively applying a set of promising AppResubs for circuit simplification. To evaluate errors caused by a set of AppResubs, a novel error model capable of efficiently computing an error upper bound is used to smartly apply AppResubs in the ALS flow. The experimental results demonstrate that compared to a state-of-the-art method, the proposed flow further reduces 20.9% area and 21.7% delay under the mean error distance constraint, while being <inline-formula> <tex-math>$400times $ </tex-math></inline-formula> faster. The code of our flow is open-source.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2040-2053"},"PeriodicalIF":2.7,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PDNNet: PDN-Aware GNN-CNN Heterogeneous Network for Dynamic IR Drop Prediction","authors":"Yuxiang Zhao;Zhuomin Chai;Xun Jiang;Yibo Lin;Runsheng Wang;Ru Huang","doi":"10.1109/TCAD.2024.3509796","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3509796","url":null,"abstract":"IR drop on the power delivery network (PDN) is closely related to PDN’s configuration and cell current consumption. As the integrated circuit (IC) design is growing larger, dynamic IR drop simulation becomes computationally unaffordable and machine learning-based IR drop prediction has been explored as a promising solution. Although convolutional neural network (CNN)-based methods have been adapted to IR drop prediction task in several works, the shortcomings of overlooking PDN configuration is non-negligible. In this article, we consider not only how to properly represent cell-PDN relation, but also how to model IR drop following its physical nature in the feature aggregation procedure. Thus, we propose a novel graph structure, PDNGraph, to unify the representations of the PDN structure and the fine-grained cell-PDN relation. We further propose a dual-branch heterogeneous network, PDNNet, incorporating two parallel GNN-CNN branches to favorably capture the above features during the learning process. Several key designs are presented to make the dynamic IR drop prediction highly effective and interpretable. We are the first work to apply graph structure to deep-learning-based dynamic IR drop prediction method. Experiments show that PDNNet outperforms the state-of-the-art CNN-based methods and achieves <inline-formula> <tex-math>$545times $ </tex-math></inline-formula> speedup compared to the commercial tool, which demonstrates the superiority of our method.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2253-2263"},"PeriodicalIF":2.7,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nonintrusive Data-Driven Model Order Reduction for Circuits Based on Hammerstein Architectures","authors":"Joshua Hanson;Paul Kuberry;Biliana Paskaleva;Pavel Bochev","doi":"10.1109/TCAD.2024.3509797","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3509797","url":null,"abstract":"We demonstrate that system identification techniques can provide a basis for effective, nonintrusive model order reduction (MOR) for common circuits that are key building blocks in microelectronics. Our approach is motivated by the practical operation of these circuits and utilizes a canonical Hammerstein architecture. To demonstrate the approach we develop parsimonious Hammerstein models for a nonlinear CMOS differential amplifier and an operational amplifier circuit. We train these models on a combination of direct current (DC) and transient SPICE circuit simulation data using a novel sequential strategy to identify their static nonlinear and linear dynamical parts. Simulation results show that the Hammerstein model is an effective surrogate for these types of circuits that accurately and efficiently reproduces their behavior over a wide range of operating points and input frequencies.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2314-2327"},"PeriodicalIF":2.7,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144100073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical Model Checking of SystemVerilog-Specified Asynchronous Circuits for Deadlock Detection","authors":"Longlong Lu;Minxue Pan;Yifei Lu;Xuandong Li","doi":"10.1109/TCAD.2024.3509798","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3509798","url":null,"abstract":"Specifying channel-based asynchronous circuits in SystemVerilog is a promising alternative design paradigm to combine the advantages of asynchronous circuits and industrial electronic design automation supports. However, communicating through channels can be error-prone, potentially introducing deadlocks that cannot be detected easily through simulation. In contrast, model checking can reliably identify deadlocks, but faces challenges related to scalability and modeling capability. This research proposes a novel model checking approach, named Verilock, to detect deadlocks of channel-based asynchronous circuits specified in SystemVerilog. To address the issue of modeling capability, Verilock extracts intermodule communication behavior from SystemVerilog circuit designs and builds models in communication protocols specifically designed for this purpose. Additionally, Verilock employs a novel hierarchical model checking algorithm that conducts localized verification of well-formed groups of the system from the bottom up, thus reducing the size of the checking problems and presenting the opportunity to parallelize the checking process. Extensive experimental evaluations confirm the efficiency of Verilock in publicly accessible and randomly synthesized large-scale asynchronous circuits. Remarkably, significant benefits of the hierarchical checking approach are demonstrated through an ablative experiment.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2424-2437"},"PeriodicalIF":2.7,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144100062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated Design for Multiorgan-on-Chip Geometries","authors":"Maria Emmerich;Philipp Ebner;Robert Wille","doi":"10.1109/TCAD.2024.3509795","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3509795","url":null,"abstract":"Multiorgans-on-chips (multi-OoCs) represent human or other animal physiology on a chip—providing testing platforms for the pharmaceutical, cosmetic, and chemical industries. They are composed of miniaturized organ tissues (so-called organ modules) that are connected via a microfluidic channel network and, by this, represent organ functionalities and their interactions on-chip. The design of these multi-OoC geometries, however, requires a sophisticated orchestration of numerous aspects, such as the size of organ modules, the required shear stress on membranes and subsequently the flow rate, the dimensions and geometry of channels, pump pressures, etc. Mastering all this constitutes a nontrivial design task for which, unfortunately, no automatic support exists yet. In this work, we propose a design automation solution for multi-OoC geometries. To this end, we review the respective design steps and derive a corresponding formal design specification from them. Based on that, we then propose an automatic design tool, which generates a design of the desired device and exports it in a fashion that is ready for subsequent simulation or fabrication. The open-source tool and a step-by-step tutorial are available at <uri>https://github.com/cda-tum/mmft-ooc-designer</uri>. Evaluations (inspired by real-world use cases and confirmed by computational fluid dynamic simulations as well as a fabrication process) demonstrate the applicability and validity of the proposed approach.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2287-2299"},"PeriodicalIF":2.7,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10771959","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-Aware Heterogeneous Federated Learning via Approximate DNN Accelerators","authors":"Kilian Pfeiffer;Konstantinos Balaskas;Kostas Siozios;Jörg Henkel","doi":"10.1109/TCAD.2024.3509793","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3509793","url":null,"abstract":"In Federated Learning (FL), devices that participate in the training usually have heterogeneous resources, i.e., energy availability. In current deployments of FL, devices that do not fulfill certain hardware requirements are often dropped from the collaborative training. However, dropping devices in FL can degrade training accuracy and introduce bias or unfairness. Several works have tackled this problem on an algorithm level, e.g., by letting constrained devices train a subset of the server neural network (NN) model. However, it has been observed that these techniques are not effective w.r.t. accuracy. Importantly, they make simplistic assumptions about devices’ resources via indirect metrics, such as multiply accumulate (MAC) operations or peak memory requirements. We observe that memory access costs (that are currently not considered in simplistic metrics) have a significant impact on the energy consumption. In this work, for the first time, we consider on-device accelerator design for FL with heterogeneous devices. We utilize compressed arithmetic formats and approximate computing, targeting to satisfy limited energy budgets. Using a hardware-aware energy model, we observe that, contrary to the state of the art’s moderate energy reduction, our technique allows for lowering the energy requirements (by <inline-formula> <tex-math>$4times $ </tex-math></inline-formula>) while maintaining higher accuracy.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2054-2066"},"PeriodicalIF":2.7,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}