{"title":"RaPC: Raw Bit Error Rate Aware Polar Coding for 3-D nand Flash Memory","authors":"Ruifeng Tu;Meng Zhang;Changsheng Xie;Fei Wu","doi":"10.1109/TCAD.2025.3540375","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3540375","url":null,"abstract":"Reliability challenges like random telegraph noise (RTN) and intercell electrostatic interference have gotten worse as feature sizes in planar<sc>nand</small> flash memory continue to reduce. In order to improve storage capacity, 3-D stacking of<sc>nand</small> flash memory has emerged as the preferred development path. However, additional challenges are brought about by the switch to 3-D<sc>nand</small> flash, such as shorter lifespans and lower reliability as a result of higher integration densities and intricate vertical interference. This article proposes RaPC: a raw bit error rate (RBER) aware polar coding scheme for improving data reliability of 3-D<sc>nand</small> flash memory. According to the variation of the RBER, the error correction ability of the polar code is dynamically adjusted to correct bit errors, which ensures the reliability and reduces the decoding delay. Simulation results demonstrate that RaPC offers significant advantages in decoding latency and performance over conventional low-density parity-check (LDPC) codes within specific RBER ranges, making it a promising solution for enhancing the reliability of 3-D<sc>nand</small> flash memory.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3546-3559"},"PeriodicalIF":2.9,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hua Feng;Debao Wei;Qi Wang;Yongchao Wang;Liyan Qiao;Zongliang Huo
{"title":"Temperature Effects of Program Operation in 3-D nand Flash Memory: Observations, Analysis, and Solutions","authors":"Hua Feng;Debao Wei;Qi Wang;Yongchao Wang;Liyan Qiao;Zongliang Huo","doi":"10.1109/TCAD.2025.3539982","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3539982","url":null,"abstract":"As flash memory storage density continues to increase, it has become the mainstream storage medium for electronic devices. Writing data in low-temperature environment causes distortions in the flash memory threshold voltage distribution (TVD), which spikes the raw bit error rate and ultimately leads to degradation of the performance of flash-based electronic devices. To ameliorate the reliability problem caused by flash memory read and program temperature variations, this study proposes a flash memory programming temperature compensation algorithm based on read reference voltage (PTC-RRV) calibration. 3-D triple-level cell (TLC) flash memory is currently the mainstream storage medium for consumer electronics. Based on a large number of real tests on this type of chips, the relationship between the programming/reading temperature and the TVD of flash memory is fully characterized, and a programming temperature compensation model is constructed. The model evaluation results show that the PTC-RRV strategy can significantly reduce the average number of read-retry of low temperature written data and effectively improve the storage reliability and read performance of flash memory, whose optimization effect on electronic devices is better than the existing temperature compensation algorithms.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3313-3322"},"PeriodicalIF":2.9,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongan Zhang;Yuecheng Li;Syed Shakib Sarwar;H. Ekin Sumbul;Yonggan Fu;Haoran You;Cheng Wan;Yingyan Lin
{"title":"Re-CATA: Real-Time and Flexible Accelerator Design Framework for On-Device Codec Avatars","authors":"Yongan Zhang;Yuecheng Li;Syed Shakib Sarwar;H. Ekin Sumbul;Yonggan Fu;Haoran You;Cheng Wan;Yingyan Lin","doi":"10.1109/TCAD.2025.3539600","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3539600","url":null,"abstract":"Real-time Codec Avatars, which employ deep generative models for 3-D reconstruction of human features, are crucial for immersive telepresence in augmented reality and virtual reality (AR/VR) environments. However, deploying these avatars in real-time on AR/VR headsets is challenging due to the inability of existing devices to achieve satisfying performance within stringent hardware resource constraints. To address these challenges, we introduce Re-CATA, an innovative full-stack and flexible Codec Avatar accelerator design framework. Re-CATA is designed to deliver real-time throughput (greater than 120 FPS) for the complete Codec Avatar processing pipeline within an edge-level power budget of 5 W under FPGA prototyping. Our approach begins by abstracting the operation mapping and scheduling challenges inherent in Codec Avatars, which require both centralized and distributed processing to handle dynamically changing workloads. We propose a novel hardware resource and workload partitioning scheme optimized for these fluctuating demands. To complement this, we introduce an agile runtime scheduling system for efficient workload reallocation among computing units as needed, recognizing the limitations of static partitioning in rapidly evolving workload scenarios. Furthermore, our micro-architecture design incorporates unified computing modules and efficient hardware peripherals, enabling seamless workload balancing across the Codec Avatar processing pipeline. We evaluate the Re-CATA accelerators via on-board FPGA prototyping, comparing them to various baselines, including commercial AR/VR system-on-chips and academic accelerators. This evaluation demonstrates a maximum speedup of up to <inline-formula> <tex-math>$5.95times $ </tex-math></inline-formula> under similar settings.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3020-3033"},"PeriodicalIF":2.7,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Single-Pass: An Operation Unit-Based In-Memory Computing Architecture for Sparse Neural Networks","authors":"Shang Wang;Qi Cao;Yongqiang Wang;Hang Chen;Zhenjiao Chen;Feng Liang","doi":"10.1109/TCAD.2025.3539592","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3539592","url":null,"abstract":"Compute-in-memory (CIM) has emerged as a prominent research focus in recent years, offering a promising alternative for advancing traditional von Neumann architecture computers. However, the extensive array structures and peripheral circuits inherent in CIM introduce challenges related to latency and power consumption. The operation unit (OU) has gained attention as a practical solution to these issues, significantly transforming the computational paradigm of in-memory computing. Despite its potential, the possibilities enabled by this approach remain underexplored. This article presents a novel architecture, single-pass, designed around OU implementation with a new OU partitioning method optimized for sparse networks. Additionally, we propose a matrix compression technique leveraging a dual heuristic greedy algorithm (DHGA), forming the foundation of our architecture-specific mapping strategy. Experimental results demonstrate that, within given area constraints, our architecture achieves an average energy efficiency improvement of 29.8% and a speedup of 82.3% across various networks compared to the baseline.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"2952-2965"},"PeriodicalIF":2.7,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Shallow Quantum Circuit Implementation of Symmetric Functions With Limited Ancillary Qubits","authors":"Wei Zi;Junhong Nie;Xiaoming Sun","doi":"10.1109/TCAD.2025.3539002","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3539002","url":null,"abstract":"Optimizing the depth and number of ancillary qubits in quantum circuits is crucial in quantum computation, given the limitations imposed by current quantum devices. In this article, we introduce an innovative approach for implementing arbitrary symmetric Boolean functions using poly-logarithmic depth quantum circuits with only a logarithmic number of ancillary qubits. Symmetric functions are those whose outputs are dictated solely by the Hamming weight of the inputs. These functions find applications across various domains, including quantum machine learning and arithmetic circuit synthesis. Moreover, by fully leveraging the potential of qutrits, the ancilla count can be further reduced to just one. The key technique involves a novel poly-logarithmic depth quantum circuit designed to compute Hamming weight without the need for ancillary qubits. This quantum circuit for Hamming weight is of independent interest due to its wide-ranging applications, such as in quantum memory, quantum machine learning, and Hamiltonian dynamics simulations.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3060-3072"},"PeriodicalIF":2.7,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaehyuk Lim;Donghwan Han;Juho Sung;Seokchan Yoon;Sanghyun Kang;Gwon Kim;Hyoung Won Baac;Changhwan Shin
{"title":"Device Design Guidelines to Boost Up AC Performance of CFET (Complementary Field-Effect-Transistor)-Based Inverter","authors":"Jaehyuk Lim;Donghwan Han;Juho Sung;Seokchan Yoon;Sanghyun Kang;Gwon Kim;Hyoung Won Baac;Changhwan Shin","doi":"10.1109/TCAD.2025.3539599","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3539599","url":null,"abstract":"Complementary field-effect transistors (CFETs) have emerged as promising candidates for next-generation semiconductor devices. CFETs feature a structure with an nMOS (or pMOS) transistor at the bottom and a transistor of the opposite type at the top. CFETs can be classified into Fin-CFETs or GAA-CFETs based on their channel structure. In this study, we compare and analyze these two devices to determine which structure is more favorable for device scaling and which device exhibits better performance per unit area. For a reliable analysis, the threshold voltage was adjusted to be the same for all devices. Initially, to compare the DC performance, the on-state drive currents in both linear mode and saturation mode operations were extracted and compared from the <inline-formula> <tex-math>$I_{mathrm { DS}}$ </tex-math></inline-formula>-versus-<inline-formula> <tex-math>$V_{mathrm { GS}}$ </tex-math></inline-formula> input-transfer characteristics. Subsequently, complementary metal-oxide-semiconductor inverters were constructed to compare their AC performance. Six parameters were extracted and compared: high-to-low propagation delay (<inline-formula> <tex-math>$t_{pLH}$ </tex-math></inline-formula>), falling time (<inline-formula> <tex-math>$t_{f}$ </tex-math></inline-formula>), low-to-high propagation delay (<inline-formula> <tex-math>$t_{pLH}$ </tex-math></inline-formula>), rising time (<inline-formula> <tex-math>$t_{r}$ </tex-math></inline-formula>), overshoot voltage (<inline-formula> <tex-math>$V_{ov}$ </tex-math></inline-formula>), and undershoot voltage (<inline-formula> <tex-math>$V_{und}$ </tex-math></inline-formula>). Based on the results, we suggest which CFET structure is more suitable for device scaling.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3189-3196"},"PeriodicalIF":2.7,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144663730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongyi Li;Qingyuan Yang;Songchen Ma;Rong Zhao;Xinglong Ji
{"title":"RoboSpike: Fully Utilizing the Heterogeneous System With Subcallback Scheduling in ROS 2","authors":"Hongyi Li;Qingyuan Yang;Songchen Ma;Rong Zhao;Xinglong Ji","doi":"10.1109/TCAD.2025.3538615","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3538615","url":null,"abstract":"The advancement in artificial intelligence (AI) has greatly propelled the development of robotics, requiring the adoption of heterogeneous computing architectures with multicore CPUs, GPUs, and accelerators to meet the growing computational needs of edge computing. Such heterogeneity, coupled with the inherently IO-intensive nature of robotic applications, poses substantial challenges for task scheduling and resource management. These challenges are particularly acute for systems striving to maximize computational resource utilization, which cannot be effectively addressed through callback-level scheduling. To overcome these obstacles, we developed RoboSpike, a systematic solution built on the Robot Operating System 2 (ROS 2). We first implemented a subcallback scheduling mechanism utilizing coroutines to utilize the blocked CPUs which wait for I/O operations. Building on this mechanism, we extended the design to incorporate the coprocessor and introduced an auto-tuning algorithm to adapt to system performance variations. Finally, we performed the response time analysis to ensure that the RoboSpike is predictable in time. The evaluation results demonstrate that RoboSpike achieves substantial improvements, increasing throughput by 1.65–2.25 times in real-world scenarios. RoboSpike enhances the scheduling capabilities of ROS 2 by refining the granularity from the callback level, thus opening up new opportunities for performance improvement in robotic systems, especially in resource-limited scenarios with complex workloads.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"2897-2910"},"PeriodicalIF":2.7,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coexisting Hyperchaos in a Memristive Neuromorphic Oscillator","authors":"Xin Zhang;Chunbiao Li;Tengfei Lei;Herbert Ho-Ching Iu;Tomasz Kapitaniak","doi":"10.1109/TCAD.2025.3538692","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3538692","url":null,"abstract":"Memristors have been widely integrated into neurons as the bridge for introducing external magnetic induction currents. The complex oscillation induced by the external magnetic stimulation is a hot topic in neuron dynamics. When a memristor is introduced into the Hindmarsh-Rose (HR) neuron to simulate the external magnetic field, a novel memristive neuromorphic hyperchaotic oscillator is constructed. The memristor weight can trigger complex neuronal firing dynamics, including the rare hyperchaotic bursting. Furthermore, when the technology of offset boosting-oriented attractor doubling is employed, a double-scroll hyperchaotic attractor can be generated, which could split into three independent coexisting attractors under some specific offsets. More interesting, two symmetric periodic attractors and two symmetric hyperchaotic attractors can coexist under certain conditions. In this work, a neuron with coexisting hyperchaotic attractors is constructed and exhaustively explored, which provides a good candidate for constituting large-scale brain-like neuromorphic oscillator. A PCB-based hardware circuit produces the oscillations validating the numerical simulations and theoretical analyses.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3179-3188"},"PeriodicalIF":2.7,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144663729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reducing Transistor Count in CMOS Logic Design Through Clustering and Library-Independent Multiple-Output Logic Synthesis","authors":"Anup Kumar Biswas;Dimitri Kagaris","doi":"10.1109/TCAD.2025.3538492","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3538492","url":null,"abstract":"We propose a novel transistor-level synthesis method to minimize the number of transistors needed to implement a digital circuit. In contrast with traditional standard cell design methods or transistor-level synthesis methods based on single-input “complex” gates or “super” gates, our method considers multioutput clusters as the basic resynthesis unit. Our tool takes any gate-level circuit netlist as input and divides it into several clusters of user-controlled size. For each output of a cluster, a simplified sum of product (SOP) expression is obtained and all such expressions are jointly minimized for the cluster using the MOTO-X multioutput transistor-level synthesis tool. Then, we consider groups of clusters, referred to as “superclusters,” to collectively reduce the overall transistor count. Experimental results indicate average transistor count reductions compared to the ABC synthesis tool of 9.95%, 6.53%, 10.49%, 13.09%, and 9.76% for the ISCAS’85, LGSynth’89, LGSynth’91, EPFL’15 and ITC’99 benchmark suites, respectively. Furthermore, our proposed approach proves to be more efficient than the transistor-mapped binary decision diagram approach, highlighting the potential of our methodology for optimizing integrated circuits at the transistor-level while delivering enhancements in power efficiency and demonstrating varied improvements in delay performance.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"3046-3059"},"PeriodicalIF":2.7,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DynMap: A Heuristic Dynamic Mapper for CGRA Multitask Dynamic Resource Allocation","authors":"Yufei Yang;Chenhao Xie;Liansheng Liu;Xiyuan Peng","doi":"10.1109/TCAD.2025.3537975","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3537975","url":null,"abstract":"Coarse-grained reconfigurable architecture (CGRA) has received increasing attention in both industry and academia due to its comprehensive advantages of performance, energy efficiency, and flexibility. To improve the resource utilization and handle the mixing workloads in the real-world, multiple tasks sharing the whole CGRA has became an important technical trend, and the varying resource requirements throughout their life cycles also makes run-time dynamic resource allocation (DRA) necessary for higher-multitask throughput. As the key stage of DRA, dynamic mapping (DM) is responsible for mapping kernels within each task to the dynamically allocated CGRA resources. However, existing DM methods have difficulty to balance the mapping time and the mapping quality, resulting in a significant gap between the actual and the optimal task throughput. To address the challenge, we propose DynMap, a heuristic dynamic mapper for CGRA multitask DRA. With the support of specialized scheduling and routing schemes, DynMap heuristically references the placement tendency in the static mapping result to dramatically save the mapping time, while maintaining the high-mapping quality by minimizing the possibility of resource conflicts. Experimental evaluation demonstrates DynMap not only achieves the average 1.17 ms mapping time and average 98.33% of the optimal mapping quality on different CGRA architectures, but also reaches average 98.85% of the optimal task throughput expected by different CGRA multitask DRA scenarios, reducing the gap between actual and optimal task throughput average <inline-formula> <tex-math>$31.75times $ </tex-math></inline-formula> smaller than that of the current methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"2979-2991"},"PeriodicalIF":2.7,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}