{"title":"Reducing Transistor Count in CMOS Logic Design Through Clustering and Library-Independent Multiple-Output Logic Synthesis","authors":"Anup Kumar Biswas;Dimitri Kagaris","doi":"10.1109/TCAD.2025.3538492","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3538492","url":null,"abstract":"We propose a novel transistor-level synthesis method to minimize the number of transistors needed to implement a digital circuit. In contrast with traditional standard cell design methods or transistor-level synthesis methods based on single-input “complex” gates or “super” gates, our method considers multioutput clusters as the basic resynthesis unit. Our tool takes any gate-level circuit netlist as input and divides it into several clusters of user-controlled size. For each output of a cluster, a simplified sum of product (SOP) expression is obtained and all such expressions are jointly minimized for the cluster using the MOTO-X multioutput transistor-level synthesis tool. Then, we consider groups of clusters, referred to as “superclusters,” to collectively reduce the overall transistor count. Experimental results indicate average transistor count reductions compared to the ABC synthesis tool of 9.95%, 6.53%, 10.49%, 13.09%, and 9.76% for the ISCAS’85, LGSynth’89, LGSynth’91, EPFL’15 and ITC’99 benchmark suites, respectively. Furthermore, our proposed approach proves to be more efficient than the transistor-mapped binary decision diagram approach, highlighting the potential of our methodology for optimizing integrated circuits at the transistor-level while delivering enhancements in power efficiency and demonstrating varied improvements in delay performance.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3046-3059"},"PeriodicalIF":2.7,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DynMap: A Heuristic Dynamic Mapper for CGRA Multitask Dynamic Resource Allocation","authors":"Yufei Yang;Chenhao Xie;Liansheng Liu;Xiyuan Peng","doi":"10.1109/TCAD.2025.3537975","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3537975","url":null,"abstract":"Coarse-grained reconfigurable architecture (CGRA) has received increasing attention in both industry and academia due to its comprehensive advantages of performance, energy efficiency, and flexibility. To improve the resource utilization and handle the mixing workloads in the real-world, multiple tasks sharing the whole CGRA has became an important technical trend, and the varying resource requirements throughout their life cycles also makes run-time dynamic resource allocation (DRA) necessary for higher-multitask throughput. As the key stage of DRA, dynamic mapping (DM) is responsible for mapping kernels within each task to the dynamically allocated CGRA resources. However, existing DM methods have difficulty to balance the mapping time and the mapping quality, resulting in a significant gap between the actual and the optimal task throughput. To address the challenge, we propose DynMap, a heuristic dynamic mapper for CGRA multitask DRA. With the support of specialized scheduling and routing schemes, DynMap heuristically references the placement tendency in the static mapping result to dramatically save the mapping time, while maintaining the high-mapping quality by minimizing the possibility of resource conflicts. Experimental evaluation demonstrates DynMap not only achieves the average 1.17 ms mapping time and average 98.33% of the optimal mapping quality on different CGRA architectures, but also reaches average 98.85% of the optimal task throughput expected by different CGRA multitask DRA scenarios, reducing the gap between actual and optimal task throughput average <inline-formula> <tex-math>$31.75times $ </tex-math></inline-formula> smaller than that of the current methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"2979-2991"},"PeriodicalIF":2.7,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James A. Boyle;Mark Plagge;Suma George Cardwell;Frances S. Chance;Andreas Gerstlauer
{"title":"SANA-FE: Simulating Advanced Neuromorphic Architectures for Fast Exploration","authors":"James A. Boyle;Mark Plagge;Suma George Cardwell;Frances S. Chance;Andreas Gerstlauer","doi":"10.1109/TCAD.2025.3537971","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3537971","url":null,"abstract":"Neuromorphic computing is concerned with designing computer architectures inspired by the brain, with recent work focusing on platforms to efficiently execute large spiking neural networks (SNNs). Future designs are expected to improve their capabilities and performance by incorporating novel features, such as emerging neuromorphic devices and analog computation. There is, however, a lack of high-level performance estimation tools to evaluate the impact of such features at the architectural level, to evaluate architectural tradeoffs, and to aid with co-design and design-space exploration. Existing neuromorphic simulators either do not consider hardware performance, only model abstract SNN dynamics or are targeted to a single specific architecture. In this work, we propose SANA-FE, a novel simulator that can rapidly and accurately estimate performance and energy efficiency of different SNN-based designs. Our simulator uses a general and configurable architecture description format that can specify a wide range of neuromorphic designs. Using such an architecture description, SANA-FE simulates system activity when executing a given spiking application at an abstract time-step granularity, and it uses activity counts and per-activity performance metrics to estimate energy and latency for each time-step. We further show a calibration methodology and apply it to model performance of Intel’s Loihi platform. Results demonstrate that our simulator can predict Loihi’s energy and latency for three real-world applications, within 12% and 25%, respectively. We further model IBM’s TrueNorth architecture, simulating a random network over <inline-formula> <tex-math>$20times $ </tex-math></inline-formula> faster than existing discrete-event-based TrueNorth simulators. Finally, we demonstrate SANA-FE’s design-space exploration capabilities by optimizing a Loihi baseline architecture for two application, reducing run-time by 21% while increasing dynamic energy usage by only 2%.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3165-3178"},"PeriodicalIF":2.7,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144663731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Varying Periods of In-Field Testing With Storage- and Counter-Based Logic Built-In Self-Test","authors":"Irith Pomeranz","doi":"10.1109/TCAD.2025.3536384","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3536384","url":null,"abstract":"In-field testing is important for detecting defects that escaped manufacturing tests or occurred during the lifetime of a chip. When in-field testing is performed periodically, some of the test periods may be shorter than others. Short test periods should focus on the faults that are the most likely to occur with aging, whereas long test periods can apply a more comprehensive test set. This article studies this scenario in the context of a logic built-in self-test (LBIST) approach that partitions compressed tests into subvectors for on-chip storage, and combines subvectors into compressed tests on-chip using counters. This approach has low storage requirements, allows complete fault coverage to be achieved, and uses a moderate number of tests. The problem of applying a small number of tests during a short testing period is formulated as a static problem of rearranging the subvectors (with possible repetitions and modification) such that the first <inline-formula> <tex-math>$n_{1}$ </tex-math></inline-formula> subvectors are sufficient for detecting a subset of faults <inline-formula> <tex-math>$F_{1}$ </tex-math></inline-formula>, and <inline-formula> <tex-math>$n_{1}$ </tex-math></inline-formula> is as small as possible. Experimental results for benchmark circuits in an academic environment demonstrate the number of tests and overall storage requirements.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3241-3245"},"PeriodicalIF":2.7,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144663719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compact Geometric Feature Representation for Improved Capacitance Pattern-Matching in Parasitic Extraction","authors":"Ping Li;Zhong Guan","doi":"10.1109/TCAD.2025.3536380","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3536380","url":null,"abstract":"The runtime and accuracy of interconnect parasitic extraction are becoming increasingly crucial for integrated circuit design in advanced manufacturing processes. In this study, we propose a novel method of capacitance matching that maps low-level features to high-level spaces, which reduces feature dimensions without losing essential information and provides a compact form for the geometric features of 2-D patterns in full-chip capacitance extraction. Furthermore, we are introducing a creative labeling strategy that eliminates the requirement for separate task-specific heads or different input representations. This innovative approach enables simultaneous data processing for both total and coupling capacitance tasks, leading to a significant reduction of complexities. Our experiments demonstrate that our entire feature representation and pattern-matching algorithm delivers exceptional accuracy, improved runtime, providing an efficient solution for large-scale capacitance extraction.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3085-3098"},"PeriodicalIF":2.7,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Highly Reliable Dual-Mode RRAM PUF With Key Concealment Scheme","authors":"Jiang Li;Yijun Cui;Chongyan Gu;Chenghua Wang;Weiqiang Liu;Shahar Kvatinsky","doi":"10.1109/TCAD.2025.3536376","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3536376","url":null,"abstract":"Physical unclonable function (PUF) has been widely used in the Internet of Things (IoT) as a promising hardware security primitive. In recent years, PUFs based on resistive random access memory (RRAM) have demonstrated excellent reliability and integration density. Most previous designs store PUF keys directly in RRAMs, increasing vulnerability to attacks. This article proposes a dual-mode RRAM PUF, named differential mode and flexible mode, utilizing the difference in switching capability between RRAMs during parallel SET operations as the entropy source. The proposed PUF can reliably reproduce keys between cycles, so a key concealment scheme is used to protect PUF keys from being continuously exposed, improving the security of the RRAM PUF. The proposed RRAM PUF exhibits high reliability over ±10% VDD and a wide temperature range from −25°C to 125°C through post-processing operations. The flexible mode can generate a significant number of keys for high-security applications. Since the PUF keys can be concealed, the proposed PUF is compatible with in-memory computing. It can be implemented using the same RRAM array as experimentally validated using a MAGIC operation, thus reducing the hardware overhead.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"2870-2882"},"PeriodicalIF":2.7,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chee-An Yu;Yu-Tung Liu;Yu-Hao Cheng;Shao-Yu Wu;Hung-Ming Chen;C.-C. Jay Kuo
{"title":"GIRD: A Green IR-Drop Estimation Method","authors":"Chee-An Yu;Yu-Tung Liu;Yu-Hao Cheng;Shao-Yu Wu;Hung-Ming Chen;C.-C. Jay Kuo","doi":"10.1109/TCAD.2025.3534118","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3534118","url":null,"abstract":"An energy-efficient high-performance static IR-drop estimation method based on green learning called Green IR Drop (GIRD) is proposed in this work. GIRD processes the IC design input in three steps. First, the input netlist data are converted to multichannel maps. Their joint spatial–spectral representations are determined with PixelHop. Next, discriminant features are selected using the relevant feature test (RFT). Finally, the selected features are fed to the eXtreme Gradient Boosting trees regressor. Both PixelHop and RFT are green learning tools. GIRD yields a low carbon footprint due to its smaller model sizes and lower computational complexity. Besides, its performance scales well with small training datasets. Experiments on synthetic and real circuits are given to demonstrate the superior performance of GIRD. The model size and the complexity, measured by the floating point operations (FLOPs) of GIRD, are only <inline-formula> <tex-math>$10^{-3}$ </tex-math></inline-formula> and <inline-formula> <tex-math>$10^{-2}$ </tex-math></inline-formula> of deep-learning methods, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3073-3084"},"PeriodicalIF":2.7,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CMCache: An Adaptive Cross-Level Data Placement Method for Multilevel Cache","authors":"Zhaoyang Zeng;Yujuan Tan;Zhulin Ma;Jiali Li;Sanle Zhao;Duo Liu;Xianzhang Chen;Ao Ren","doi":"10.1109/TCAD.2025.3534116","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3534116","url":null,"abstract":"Multilevel cache systems enhance I/O performance by optimizing data placement across various cache levels from a global perspective. However, existing methods often struggle to place data at the optimal cache level promptly due to their reliance on historical access patterns and inflexible placement strategies. These methods face two main challenges: 1) for already cached data with sufficient access history, existing approaches only optimize movement between adjacent cache levels, potentially delaying data arrival at its globally optimal cache level and leading to unnecessary bandwidth consumption and increased latency and 2) for newly entered data without access history, current methods cannot accurately predict their future hotness and simply place them at a fixed cache level (i.e., first or final level), overlooking future accesses of new data and potentially resulting in high cache miss rates or cache pollution. To address these issues, we propose CMCache, an adaptive cross-level data placement method for multilevel cache. CMCache applies distinct placement strategies for cached and new data to reach the optimal level timely, considering their different characteristics. It also logically divides cache space into two sections to manage cached and new data separately, dynamically adjusting section sizes based on access patterns. This approach significantly improves data placement efficiency, achieving up to an 89% reduction in miss rates and a 79% decrease in average response times compared to existing methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"2911-2924"},"PeriodicalIF":2.7,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuya Ji;Weidong Yang;Jianfei Jiang;Naifeng Jing;Honglan Jiang;Zhigang Mao;Qin Wang
{"title":"MACS: A Multidomain Collaborative Adaptive Clock Scheme for Large-Scale Reconfigurable Dataflow Accelerators","authors":"Shuya Ji;Weidong Yang;Jianfei Jiang;Naifeng Jing;Honglan Jiang;Zhigang Mao;Qin Wang","doi":"10.1109/TCAD.2025.3533305","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3533305","url":null,"abstract":"To guarantee reliability and correctness, VLSI circuits are designed with conservative margins to maintain timing and power integrity against process, voltage, and temperature (PVT) variations across diverse workloads. However, worst-case PVT and workload conditions rarely occur in practice, resulting in significant timing slack and hence performance and energy loss, especially in reconfigurable dataflow accelerator RDA due to their large-scale and configurable features. Previous studies have attempted to exploit workload or PVT slack, yet achieving limited benefits for reconfigurable dataflow accelerator (RDAs) with large-scale processing element PE arrays. The key issues come from restricted scaling ranges for the clock, insufficient representations for the workload, and unbalanced workloads within processing elementss (PEs). To address these challenges, this article proposes the first multidomain collaborative adaptive clock scheme (MACS) to efficiently exploit both the workload and PVT timing slack for large-scale reconfigurable dataflow acceleratorss (RDAs). MACS partitions the RDA into several clock domains and allows constrained clock domain crossing, which enhances the hardware efficiency with minimal overhead and supports timing validation using conventional static timing analysis (STA) tools. In each domain, an operand-aware workload detection unit is developed, using both static configurations and dynamic operands to assess workload. The detected workload, combined with the monitored PVT conditions, determines the subsequent clock period. Additionally, to enable the exploration of timing slack over a broader range, the period range of the adaptive clock is extended. Experimental results show that MACS achieves a performance improvement of 76.3% or an energy saving of 36.6% with a hardware cost of 3.5%.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"2992-3005"},"PeriodicalIF":2.7,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yin Deng;Guoqi Xie;Chenglai Xiong;Sirong Zhao;Wei Ren;Kenli Li
{"title":"Zram Instance Pool Framework for Adaptive Memory Compression in Resource-Sensitive Embedded Operating Systems","authors":"Yin Deng;Guoqi Xie;Chenglai Xiong;Sirong Zhao;Wei Ren;Kenli Li","doi":"10.1109/TCAD.2025.3533300","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3533300","url":null,"abstract":"Memory compression can reduce the size of the inactive data in the random access memory (RAM), thereby freeing up unused space and allowing more programs to run; however, current mainstream memory compression frameworks (e.g., Zram and Zswap) and algorithms (e.g., Zstd and Lz4) do not effectively solve the problem of increased CPU utilization, causing they cannot be directly applied to the resource-sensitive embedded operating system, that is, sensitive to both CPU utilization and memory usage. In this study, we develop a Zram instance pool framework called ZramPool for adaptive memory compression. The framework consists of the swap space with multiple Zram instances and the adaptive Zram compression module. Through introducing linear regression analysis, the number of Zram instances can be adaptively adjusted based on the size of the compressed data, allowing Zram instances to work in parallel to match the workload. In ZramPool, we achieve two different requirements of reducing CPU utilization while keeping compression speed and increasing compression speed while keeping CPU utilization. ZramPool is deployed in the embedded Linux OS with a 8GB memory size running on the ARMv8 architecture. For the first requirement, ZramPool can reduce CPU utilization by an average of 11.42% while the compression speed only decreases by an average of 2.4%. For the second requirement, ZramPool can increase compression speed by an average of 11.71% while the CPU utilization only increases by an average of 1.9%.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"2925-2938"},"PeriodicalIF":2.7,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}