{"title":"IDWA: A Importance-Driven Weight Allocation Algorithm for Low Write–Verify Ratio RRAM-Based In-Memory Computing","authors":"Jingyuan Qu;Debao Wei;Dejun Zhang;Yanlong Zeng;Zhelong Piao;Liyan Qiao","doi":"10.1109/TVLSI.2025.3578388","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578388","url":null,"abstract":"Resistive random access memory (RRAM)-based in-memory computing (IMC) architectures are currently receiving widespread attention. Since this computing approach relies on the analog characteristics of the devices, the write variation of RRAM can affect the computational accuracy to varying degrees. Conventional write–verify (W&V) procedures are performed on all weight parameters, resulting in significant time overhead. To address this issue, we propose a training algorithm that can recover the offline IMC accuracy impacted by write variation with a lower cost of W&V overhead. We introduce a importance-driven weight allocation (IDWA) algorithm during the training process of the neural network. This algorithm constrains the values of less important weights to suppress the diffusion of variation interference on this part of the weights, thus reducing unnecessary accuracy degradation. Additionally, we employ a layer-wise optimization algorithm to identify important weights in the neural network for W&V operations. Extensive testing across various deep neural networks (DNNs) architectures and datasets demonstrates that our proposed selective W&V methodology consistently outperforms current state-of-the-art selective W&V techniques in both accuracy preservation and computational efficiency. At same accuracy levels, it delivers a speed improvement of <inline-formula> <tex-math>$6times sim 32times $ </tex-math></inline-formula> compared to other advanced methods.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2508-2517"},"PeriodicalIF":3.1,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Unstructured Sparse DNNs via Multilevel Partial Sum Reduction and PE Array-Level Load Balancing","authors":"Chendong Xia;Qiang Li;Zhi Li;Bing Li;Huidong Zhao;Shushan Qiao","doi":"10.1109/TVLSI.2025.3577626","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3577626","url":null,"abstract":"Unstructured pruning introduces significant sparsity in deep neural networks (DNNs), enhancing accelerator hardware efficiency. However, three critical challenges constrain performance gains: 1) complex fetching logic for nonzero (NZ) data pairs; 2) load imbalance across processing elements (PEs); and 3) PE stalls from write-back contention. This brief proposes an energy-efficient accelerator addressing these inefficiencies through three innovations. First, we propose a Cartesian-product output-row-stationary (CPORS) dataflow that inherently matches NZ data pairs by sequentially fetching compressed data. Second, a multilevel partial sum reduction (MLPR) strategy minimizes write-back traffic and converts random PE stalls into manageable load imbalance. Third, a kernel sorting and load scheduling (KSLS) mechanism resolves PE idle/stall and achieves PE array-level load balancing, attaining 76.6% average PE utilization across all sparsity levels. Implemented in 22-nm CMOS, the accelerator delivers <inline-formula> <tex-math>$1.85times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$1.4times $ </tex-math></inline-formula> energy efficiency over baseline and achieves 25.8 TOPS/W peak energy efficiency at 90% sparsity.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2329-2333"},"PeriodicalIF":2.8,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 28 nm Dual-Mode SRAM-CIM Macro With Local Computing Cell for CNNs and Grayscale Edge Detection","authors":"Chunyu Peng;Xiaohang Chen;Mengya Gao;Jiating Guo;Lijun Guan;Chenghu Dai;Zhiting Lin;Xiulong Wu","doi":"10.1109/TVLSI.2025.3578319","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578319","url":null,"abstract":"With the rise of artificial intelligence (AI), neural network applications are growing in demand for efficient data transmission. The traditional von Neumann architecture can no longer keep pace with modern technological needs. Computing-in-memory (CIM) is proposed as a promising solution to address this bottleneck. This work introduces a local computing cell (LCC) scheme based on compact 6T-SRAM cells. The proposed circuit aims to enhance energy efficiency and reduce power consumption by reusing the LCC. The LCC circuit can perform the multiplication of a 2-bit input with a 1-bit weight, which can be applied to convolutional neural networks (CNNs) with the multiply-accumulate (MAC) operations. Through circuit reuse, it can also be used for multibit multiply operations, performing 2-bit input multiplication and 1-bit weight addition, which can be applied to grayscale edge detection in images. The energy efficiency of the SRAM-CIM macro achieves an energy efficiency of 46.3 TOPS/W under MAC operations with input precision of 8-bits and weight precision of 8-bits, and up to 389.1–529.1 TOPS/W under the calculation in one subarray with an input precision of 2-bits and a weight precision of 1-bit. The estimated inference accuracy on CIFAR-10 datasets is 90.21%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2264-2273"},"PeriodicalIF":2.8,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Yan;Jaime Cardenas Chavez;Kamal El-Sankary;Li Chen;Xiaotong Lu
{"title":"A 10-bit 50-MS/s Radiation Tolerant Split Coarse/Fine SAR ADC in 65-nm CMOS","authors":"Ming Yan;Jaime Cardenas Chavez;Kamal El-Sankary;Li Chen;Xiaotong Lu","doi":"10.1109/TVLSI.2025.3576998","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576998","url":null,"abstract":"This article presents a 10-bit radiation-hardened-by-design (RHBD) SAR analog-to-digital converter (ADC) operating at 50 MS/s, designed for aerospace applications in high-radiation environments. The system- and circuit-level redundancy techniques are implemented to mitigate radiation-induced errors and metastability. A novel split coarse/fine asynchronous SAR ADC architecture is proposed to provide system-level redundancy. At circuits level, single-event effects (SEEs) error detection and radiation-hardened techniques are implemented. Our co-designed SEE error detection scheme includes last-bit-cycle (LBC) detection following the LSB cycle and metastability detection (MD) via a ramp generator with a threshold trigger. This approach detects and corrects radiation-induced errors using a coarse/fine redundant algorithm. The radiation-hardened latch comparators and D flip-flops (DFFs) are incorporated to further mitigate SEEs. The prototype design is fabricated using TSMC 65-nm technology, with an ADC core area of 0.0875 mm<sup>2</sup> and a power consumption of 2.79 mW at a 1.2-V power supply. Postirradiation tests confirm functionality up to 100-krad(Si) total ionizing dose (TID) and demonstrate over 90% suppression of large SEE under laser testing.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2132-2142"},"PeriodicalIF":2.8,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A High-Density Energy-Efficient CNM Macro Using Hybrid RRAM and SRAM for Memory-Bound Applications","authors":"Jun Wang;Shengzhe Yan;Xiangqu Fu;Zhihang Qian;Zhi Li;Zeyu Guo;Zhuoyu Dai;Zhaori Cong;Chunmeng Dou;Feng Zhang;Jinshan Yue;Dashan Shang","doi":"10.1109/TVLSI.2025.3576889","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576889","url":null,"abstract":"The big data era has facilitated various memory-centric algorithms, such as the Transformer decoder, neural network, stochastic computing (SC), and genetic sequence matching, which impose high demands on memory capacity, bandwidth, and access power consumption. The emerging nonvolatile memory devices and compute-near-memory (CNM) architecture offer a promising solution for memory-bound tasks. This work proposes a hybrid resistive random access memory (RRAM) and static random access memory (SRAM) CNM architecture. The main contributions include: 1) proposing an energy-efficient and high-density CNM architecture based on the hybrid integration of RRAM and SRAM arrays; 2) designing low-power CNM circuits using the logic gates and dynamic-logic adder with configurable datapath; and 3) proposing a broadcast mechanism with output-stationary workflow to reduce memory access. The proposed RRAM-SRAM CNM architecture and dataflow tailored for four distinct applications are evaluated at a 28-nm technology, achieving 4.62-TOPS<inline-formula> <tex-math>$/$ </tex-math></inline-formula>W energy efficiency and 1.20-Mb<inline-formula> <tex-math>$/$ </tex-math></inline-formula>mm<sup>2</sup> memory density, which shows <inline-formula> <tex-math>$11.35times $ </tex-math></inline-formula>–<inline-formula> <tex-math>$25.81times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.44times $ </tex-math></inline-formula>–<inline-formula> <tex-math>$4.92times $ </tex-math></inline-formula> improvement compared to previous works, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2339-2343"},"PeriodicalIF":2.8,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Design Space Exploration for the BOOM Using SAC-Based Reinforcement Learning","authors":"Mingjun Cheng;Shihan Zhang;Xin Zheng;Xian Lin;Huaien Gao;Shuting Cai;Xiaoming Xiong;Bei Yu","doi":"10.1109/TVLSI.2025.3572799","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3572799","url":null,"abstract":"Design space exploration (DSE) is crucial for optimizing the performance, power, and area (PPA) of CPU microarchitectures (<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-archs). While various machine learning (ML) algorithms have been applied to the <inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-arch DSE problem, the potential of reinforcement learning (RL) remains underexplored. In this article, we propose a novel RL-based approach to address the reduced instruction set computer V (RISC-V) CPU <inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-arch DSE problem. This approach enables dynamic selection and optimization of <inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-arch parameters without relying on predefined modification sequences, thus significantly enhancing exploration flexibility. To address the challenges posed by high-dimensional action spaces and sparse rewards, we use a discrete soft actor-critic (SAC) framework with entropy maximization to promote efficient exploration. In addition, we integrate multistep temporal-difference (TD) learning, an experience replay (ER) buffer, and return normalization to improve sample efficiency and learning stability during training. Our method further aligns optimization with user-defined preferences by normalizing PPA metrics relative to baseline designs. Experimental results on the Berkeley out-of-order machine (BOOM) demonstrate that the proposed approach achieves superior performance compared with state-of-the-art methods, showcasing its effectiveness and efficiency for <inline-formula> <tex-math>$mu $ </tex-math></inline-formula>-arch DSE. Our code is available at <uri>https://github.com/exhaust-create/SAC-DSE</uri>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2252-2263"},"PeriodicalIF":2.8,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Test Primitives: The Unified Notation for Characterizing March Test Sequences","authors":"Ruiqi Zhu;Houjun Wang;Susong Yang;Weikun Xie;Yindong Xiao","doi":"10.1109/TVLSI.2025.3577448","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3577448","url":null,"abstract":"March algorithms are essential for detecting functional memory faults, characterized by their linear complexity and adaptability to emerging technologies. However, the increasing complexity of fault types presents significant challenges to existing fault detection models regarding analytical efficiency and adaptability. This article introduces the test primitive (TP), a unified notation that characterizes March test sequences through a novel methodology that decouples fault detection operations from sensitization states. The proposed TP achieves platform independence and seamless integration of fault models, supported by rigorous theoretical proofs. These proofs establish the fundamental properties of the TP in terms of completeness, uniqueness, and conciseness, providing a theoretical foundation that ensures the decoupling method reduces the computational complexity of March algorithm analysis to <inline-formula> <tex-math>$O(1)$ </tex-math></inline-formula>. This reduction is analogous to Karnaugh map simplification in digital logic while enabling millisecond-level automated analysis. Experimental results demonstrate that the proposed method significantly enhances both analyzable fault coverage (FC) and detection accuracy, thereby addressing critical limitations of existing fault detection models.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2542-2555"},"PeriodicalIF":3.1,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Bertuletti;Yichao Zhang;Alessandro Vanelli-Coralli;Luca Benini
{"title":"A 66-Gb/s/5.5-W RISC-V Many-Core Cluster for 5G+ Software-Defined Radio Uplinks","authors":"Marco Bertuletti;Yichao Zhang;Alessandro Vanelli-Coralli;Luca Benini","doi":"10.1109/TVLSI.2025.3576855","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576855","url":null,"abstract":"Following the scale-up of new radio (NR) complexity in 5G and beyond, the physical layer’s computing load on base stations is increasing under a strictly constrained latency and power budget; base stations must process <inline-formula> <tex-math>$gt$ </tex-math></inline-formula> 20-Gb/s uplink wireless data rate on the fly, in <inline-formula> <tex-math>$lt$ </tex-math></inline-formula> 10 W. At the same time, the programmability and reconfigurability of base station components are the key requirements; it reduces the time and cost of new networks’ deployment, it lowers the acceptance threshold for industry players to enter the market, and it ensures return on investments in a fast-paced evolution of standards. In this article, we present the design of a many-core cluster for 5G and beyond base station processing. Our design features 1024, streamlined RISC-V cores with domain-specific FP extensions, and 4-MiB shared memory. It provides the necessary computational capabilities for software-defined processing of the lower physical layer of 5G physical uplink shared channel (PUSCH), satisfying high-end throughput requirements (66 Gb/s for a transition time interval (TTI), 9.4–302 Gb/s depending on the processing stage). The throughput metrics for the implemented functions are ten times higher than in state-of-the-art (SoTA) application-specific instruction processors (ASIPs). The energy efficiency on key NR kernels (2–41 Gb/s/W), measured at 800 MHz, <inline-formula> <tex-math>${25}~^{circ } $ </tex-math></inline-formula>C, and 0.8 V, on a placed and routed instance in 12-nm CMOS technology, is competitive with SoTA architectures. The PUSCH processing runs end-to-end on a single cluster in 1.7 ms, at <6-W average power consumption, achieving 12 Gb/s/W.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2225-2238"},"PeriodicalIF":2.8,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SC-IMC: Algorithm-Architecture Co-Optimized SRAM-Based In-Memory Computing for Sine/Cosine and Convolutional Acceleration","authors":"Qi Cao;Shang Wang;Haisheng Fu;Qifan Gao;Zhenjiao Chen;Li Gao;Feng Liang","doi":"10.1109/TVLSI.2025.3573753","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3573753","url":null,"abstract":"Sine/cosine (SC) is widely used in practical engineering applications, such as image compression and motor control. Nevertheless, due to power sensitivity and speed demands, SC acceleration suffers from limitations in traditional von-Neumann architectures. To overcome this challenge, we propose accelerating SC and convolution using a static random access memory (SRAM)-based in-memory computing (IMC) architecture through an algorithm-architecture co-optimization manner. We develop the first SC algorithm that transforms nonlinear operations into the IMC paradigm, enabling IMC array to handle both SC and artificial intelligence (AI) tasks and making the IMC array a reusable module. Our architecture extends computing functions of macro dedicated to convolutional neural networks (CNNs), with less than a 1% area increase. The proposed SC algorithm for FP32 data achieves high accuracy within 1 unit in the least significant place (ulp) error margin compared with <italic>C</i> math library. Moreover, we build an intelligent IMC system that supports various CNNs. Our IMC macro implements 512-kb binary weight storage within 3.0366-mm<sup>2</sup> area in SMIC 28-nm technology and presents area/energy efficiency of 2160.29–270.04 GOPS/mm<sup>2</sup> and 513.95–8.03 TOPS/W in CNN mode. The proposed algorithm and architecture facilitate the integration of more nonlinear functions into IMC with minimal area overhead.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2200-2213"},"PeriodicalIF":2.8,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Fourth-Order Tunable Bandwidth Gm-C Filter for ECG Detection Achieving −7.9 dBV IIP3 Under a 0.5 V Supply","authors":"Farzan Rezaei;Loai G. Salem","doi":"10.1109/TVLSI.2025.3576360","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3576360","url":null,"abstract":"This article introduces a fourth-order <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula>-C low-pass filter for ECG detection that achieves high linearity despite operating under a 0.5 V supply by 1) placing the differential pairs (DPs) of the employed <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula> stages in a two-loop feedback structure, 2) employing body-driven rather than gate-driven <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula> DPs, and 3) using current mirrors in place of cascoded transistors in a conventional <inline-formula> <tex-math>$G_{m}$ </tex-math></inline-formula> stage. Measurement results of a <inline-formula> <tex-math>$0.18~mu $ </tex-math></inline-formula>m CMOS prototype show that the proposed filter, operating with a <inline-formula> <tex-math>$V_{text {DD}}$ </tex-math></inline-formula> of 0.5 V, achieves an third-order harmonic distortion (HD3) below −40 dB for input amplitudes up to 340 mV<sub>pp</sub>. With an integrated noise of <inline-formula> <tex-math>$154.7~mu $ </tex-math></inline-formula>V<sub>rms</sub> over a 240-Hz bandwidth, the filter exhibits a dynamic range (DR) of 53.6 dB, which is competitive with previously reported works.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2438-2448"},"PeriodicalIF":3.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}