{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information","authors":"","doi":"10.1109/TVLSI.2025.3587930","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3587930","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"C3-C3"},"PeriodicalIF":2.8,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11096974","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin Vicuña;Massimo Vatalaro;Frédéric Amiel;Felice Crupi;Lionel Trojman
{"title":"Highly Stable Reconfigurable TERO PUF Architecture for Hardware Security Applications","authors":"Kevin Vicuña;Massimo Vatalaro;Frédéric Amiel;Felice Crupi;Lionel Trojman","doi":"10.1109/TVLSI.2025.3587502","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3587502","url":null,"abstract":"This work introduces a novel 128-bit transient effect ring oscillator (TERO)-based physically unclonable function (PUF) designed for Intel MAX 10 field-programmable gate arrays (FPGAs). A reliable PUF solution suitable for security applications targeting high stability and area efficiency is presented. The proposed cell consists of two cross-coupled reconfigurable ring oscillators (ROs) aiming to achieve zero-observed instability at both golden key (GK) and under temperature variations. Conversely to the conventional application-specific integrated circuits (ASIC) approaches, which use the mean cycles to collapse (CTC), here the calibration process was performed by considering the CTC standard deviation extracted at GK conditions, namely, 1.2 V and <inline-formula> <tex-math>$25~^{circ }$ </tex-math></inline-formula>C. The experimental results demonstrate that after the calibration process and considering a 1.64% of masked bits, the proposed solution shows a bit error rate (BER) lower than <inline-formula> <tex-math>$mathbf {1.56times 10^{-4}%}$ </tex-math></inline-formula>, the minimum observable quantity for the adopted statistical set across the entire analyzed temperature range. Further, the solution also shows an excellent uniqueness of 49.78%, close to the ideal value of 50%. This is achieved at the cost of two logic array blocks (LABs) per bit.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 10","pages":"2873-2882"},"PeriodicalIF":3.1,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11095825","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jincheng Wang;Yuhao Shu;Lintao Lan;Yifei Li;Bin Ning;Yuxin Zhou;Hongtu Zhang;Yajun Ha
{"title":"A 5T0C eDRAM-Based Content Addressable Memory for High-Density Searching and Logic-in-Memory","authors":"Jincheng Wang;Yuhao Shu;Lintao Lan;Yifei Li;Bin Ning;Yuxin Zhou;Hongtu Zhang;Yajun Ha","doi":"10.1109/TVLSI.2025.3585747","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585747","url":null,"abstract":"With the development of big data, there is an increasing demand for high-density searching, where content-addressable memory (CAM) presents an attractive solution for its ability to perform parallel searches. However, this goal is constrained by the difficulty of further reducing the area of SRAM cells, which is commonly used in traditional CAM implementations. To address this issue, we propose a novel CAM with a compact five-transistor-zero-capacitor (5T0C)-embedded dynamic random access memory (eDRAM) for high-density searching and logic-in-memory applications. First, we propose the 5T0C eDRAM gain cell featuring a 3T0C write port and a decoupled read port of 2T to achieve data storage and searching operations. Second, we present a reconfigurable sense amplifier (RSA) design with two different reference voltages to optimize the area overhead of peripheral circuits and support logic operations. Moreover, the 5T0C eDRAM-based CAM can be employed to achieve high-density searching and logic operations. We have validated the eDRAM-based CAM array in the 40-nm CMOS process. The postlayout simulation results show that our design achieves over 15% higher memory density compared to the state-of-the-art 6T SRAM. Additionally, it supports a maximum frequency of 637 and 658 MHz for binary CAM (BCAM) searching and logic operations, while consuming 0.91 and 27.47 fJ/bit at 1.1 V, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2497-2507"},"PeriodicalIF":3.1,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kari Hepola;Tharaka Ranasinghe Arachchige;Joonas Multanen;Pekka Jääskeläinen
{"title":"Automatically Retargeting Hardware and Code Generation for RISC-V Custom Instructions","authors":"Kari Hepola;Tharaka Ranasinghe Arachchige;Joonas Multanen;Pekka Jääskeläinen","doi":"10.1109/TVLSI.2025.3586902","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3586902","url":null,"abstract":"Custom instruction (CI) set extensions are beneficial for increasing performance and energy efficiency in a set of target applications. For rapid prototyping of these types of application-specific processors, designers leverage hardware (HW)/software (SW) co-design to create hardware implementations and retarget the compiler using a high-level description of the instruction set extension. Ideally, the architecture description should be flexible enough to support both hardware generation and compiler retargeting from the same description format. The challenge with these methods lies in coupling hardware extensions with the processor core, because using microarchitecture-specific interfaces leads to low design reuse and increased verification effort. To mitigate these challenges, we introduce a HW/SW co-design toolset capable of adapting to a user-defined architecture description that captures the instruction set extension semantics. Based on the architecture description, the toolset can both retarget the compiler and generate co-processors interfacing with the Core-V eXtension interface (CV-X-IF) and Rocket custom co-processor interface (RoCC) protocols that are widely used standard interfaces for RISC-V processors. To demonstrate our methods, we integrate the co-processors with two different variations of CVA6 and Rocket core. The resulting execution time reduction is up to 40% on average, with an area overhead of 8% for the CVA6. For the Rocket core, the execution time reduction is 27% with a 6% area overhead.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 10","pages":"2852-2861"},"PeriodicalIF":3.1,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11082109","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BTI Aging Analysis and Mitigation for Differential Input In-Memory Computing SRAMs","authors":"Christina Dilopoulou;Yiorgos Tsiatouhas","doi":"10.1109/TVLSI.2025.3585027","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585027","url":null,"abstract":"SRAM-based in-memory computing (IMC) is a promising approach to overcome the bottleneck of traditional Von Neumann architectures that suffer from data transfer delay and energy inefficiency. Aging phenomena and process variations are a serious reliability and lifetime concern that may impact SRAM-based IMC array architectures, similar to conventional SRAM arrays. Bias temperature instability (BTI) is a dominant aging mechanism that degrades transistor performance and negatively affects the analog nature of the IMC computations. In this work, we present a simulation framework for the joined analysis of aging and process variation influence on IMC reliable operation. We perform, through SPICE simulations, an extensive BTI aging analysis on differential input SRAM-based IMC array architectures under different operating conditions and considering process variations. The simulation results show a substantial impact of aging on their reliability. Furthermore, we present an aging mitigation technique to maintain reliability and extend the lifetime of these circuits. Aging mitigation is achieved by periodically reconfiguring the active current paths in the IMC cells, with negligible cost on throughput and power consumption. The simulation results show that up to 68% of the IMC circuits can lose accuracy after three operating years, depending on the operating conditions. The aging mitigation technique effectively reduces the percentage of circuits that lose accuracy by up to 72% and decreases their degradation rate, essentially extending by more than <inline-formula> <tex-math>$9.3times $ </tex-math></inline-formula> their reliable lifetime.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2570-2579"},"PeriodicalIF":3.1,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-Complexity Implementation of Real-Time Reconfigurable Low-Pass Equalizers","authors":"Narges Mohammadi Sarband;Oksana Moryakova;Håkan Johansson;Oscar Gustafsson","doi":"10.1109/TVLSI.2025.3578450","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578450","url":null,"abstract":"Implementation techniques and results for a recently proposed real-time reconfigurable low-pass equalizer (RLPE) consisting of a variable bandwidth (VBW) filter and a variable equalizer (VE) are presented. Both components utilize fixed finite-length impulse response (FIR) filters combined with a few general multipliers, resulting in lower area and power consumption compared to a general FIR filter, despite requiring more multiplications. This is because the constant multipliers in the fixed FIR filters of the RLPE can be optimized for implementation. An additional advantage is that the proposed RLPE does not require online design. Various implementation alternatives for fixed FIR filters, including ways to increase the frequency, are evaluated to optimize the implementation of the RLPE. Several versions of the proposed RLPE and a general FIR filter for comparison are implemented using a 28-nm fully depleted silicon on insulator (FD-SOI) standard cell library. The results demonstrate that the RLPE baseline design requires less power and area than the general equalizer, and although the frequency of the baseline implementation is lower, the design can reach the same frequency while still having significantly less power and area. Furthermore, an approach is introduced to break the chain in the polynomial section of the VBW filter by using fewer additional registers compared to standard pipelining. Instead, this method reformulates the constant multiplication problem to produce correct results. For the considered case, the power consumption is reduced between 49% and 70% for different frequencies, with an area decrease in the range of 64%–67%, by using the proposed RLPE compared to a general FIR filter.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2462-2473"},"PeriodicalIF":3.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11074767","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 3-bit/Unit Time-Domain Compute-In-Memory Macro With Adjustable Unit Delay","authors":"Xie He;Dongxu Li","doi":"10.1109/TVLSI.2025.3585360","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585360","url":null,"abstract":"With the increasing demand for high-energy efficiency in multiply-accumulate (MAC) operations within deep learning accelerators, computing-in-memory (CIM) has gained significant attention. Time-domain (TD) CIM eliminates the need for analog-to-digital converters (ADCs), but single-bit delay units suffer from low computational efficiency. To address these issues, this work presents a TD multibit-per-unit CIM macro that leverages a precision-configurable time-to-digital converter (TDC) to enable accuracy configurability. Experimental results show that the proposed design achieves a 3-bit delay unit as a multibit CIM unit and an overall of 3-byte weight precision and 8-bit input precision. Compared to using three 1-bit/unit CIM delay units with an adder, it achieves a linearity with linear offset less than 3%. Besides, bias voltage adjusts the frequency and precision of the circuit (from 600 to 900 mV), enabling a minimum delay step of 0.11 ns. This system achieves a maximum energy efficiency of 268 TOPS/W under different VDD, making it a promising solution for always-on edge AI applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 10","pages":"2897-2901"},"PeriodicalIF":3.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yishuo Meng;Jianfei Wang;Qiang Fu;Jia Hou;Siwei Xiang;Ge Li;Chen Yang
{"title":"A High-Performance SCNN Accelerator Using Parallel Sparsity Detection and Index-Oriented Computation Workflow","authors":"Yishuo Meng;Jianfei Wang;Qiang Fu;Jia Hou;Siwei Xiang;Ge Li;Chen Yang","doi":"10.1109/TVLSI.2025.3584657","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3584657","url":null,"abstract":"The customization of accelerators for sparse convolutional neural networks (SCNNs) has been shown to significantly enhance the computational efficiency of CNNs. However, while processing the widely existing irregularly distributed sparsity in filters and feature maps, serial sparsity detection (SSD) methods and small-capacity computation arrays are always applied in current works. As a result, it is difficult to fully translate the exploitation of sparsity into hardware performance improvement. Therefore, in this article, first, a novel parallel sparsity detection (PSD) scheme is proposed and hardware-implemented to efficiently extract the valid weights and activations. In addition, an index-oriented computation workflow for parallel sparse convolution is also proposed to eliminate the output index diversity during sparse convolutions. With the assistance of the above sparsity detection scheme and computation workflow, a large-scale two-side SCNN accelerator is designed and implemented on the Xilinx VCU118 platform, achieving a runtime frequency of 300 MHz. The evaluation results indicate that this work can achieve 1284.43/1105.31 GOPS performance while deploying VGG16/ResNet-50. Compared to the previous dense-/sparse-based works, this work can achieve a performance enhancement ranging from <inline-formula> <tex-math>$1.284times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$12.266times $ </tex-math></inline-formula> and a DSP efficiency improvement from <inline-formula> <tex-math>$1.718times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$6.131times $ </tex-math></inline-formula>. These results highlight the superior ability to translate sparsity exploitation into performance gains.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2449-2461"},"PeriodicalIF":3.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeongmin Kim;Jaehoon Kwon;Hansol Jeong;In-Cheol Park
{"title":"Energy-Efficient Syndrome Calculation Architecture for BCH Decoders","authors":"Jeongmin Kim;Jaehoon Kwon;Hansol Jeong;In-Cheol Park","doi":"10.1109/TVLSI.2025.3585971","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585971","url":null,"abstract":"Syndrome calculation (SC) is a critical step in Bose-Chaudhuri-Hocquenghem (BCH) decoding, and its computational efficiency significantly impacts the energy consumption of the entire decoder. This article proposes an energy-efficient SC architecture designed for BCH decoders. The proposed architecture fundamentally adopts a remainder-based SC, which consumes less energy than the conventional Horner’s method-based SC unit. Furthermore, unlike previous remainder-based approaches, it uses a minimal polynomial to produce a shorter remainder, leading to reduced computation and improved energy efficiency. Implementation results demonstrate an 80% improvement in energy efficiency compared to the latest Horner’s method-based SC unit and a 35% improvement compared to the previous remainder-based SC unit.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2488-2496"},"PeriodicalIF":3.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"All-Digital CMOS Pulse-Shrinking Time-to-Digital Converter With Built-in Offset-Error Cancellation and Smart Temperature Sensor","authors":"Chun-Chi Chen;Chao-Lieh Chen;Kai-Hsiang Chang","doi":"10.1109/TVLSI.2025.3585732","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585732","url":null,"abstract":"This brief presents an all-digital CMOS time-to-digital converter (TDC) with an integrated smart temperature sensor (STS), effectively reducing circuit complexity and cost. Unlike previous designs employing a single coupling unit, the proposed TDC adopts a two-coupling-unit structure, simplifying the overall architecture while enabling pulse-shrinking time measurement and offset-error cancellation within a single cyclic delay line. The built-in cancellation enhances linearity while minimizing overhead. Notably, the integrated STS requires only one additional coupling unit, ensuring a negligible impact on circuit complexity and cost. Fabricated using the TSMC 0.35-<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>m CMOS process, the proposed design demonstrates improved cost efficiency compared to prior works. Experimental results validate the successful measurement of time and temperature, highlighting the advantages of reduced complexity and cost savings.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2597-2601"},"PeriodicalIF":3.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}