{"title":"A Universal Sequential Authentication Scheme for TAPC-Based Test Standards","authors":"Guan-Rong Chen;Kuen-Jong Lee","doi":"10.1109/TVLSI.2025.3562015","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3562015","url":null,"abstract":"Integrated circuits (ICs) have become extremely complex nowadays. Therefore, multiple test standards could be employed to handle different testing scenarios. Unfortunately, this also leads to serious security problems since attackers can exploit the excellent controllability and observability of test standards to steal confidential information or disrupt the circuit’s functionality. This article proposes a universal sequential authentication scheme that is compatible with test standards employing the test access port controller (TAPC) defined in IEEE Std 1149.1. The main objective is to protect multiple TAPC-based test standards with a universal security module. In this scheme, only authorized test data can be updated to the target register to control the corresponding test standard, and only the response to authorized test data can be output. The key idea is to generate different authentication keys for different test data, and even with the same set of test data, if their input sequences are different, their authentication keys will also be different. Furthermore, we develop an irreversible obfuscation mechanism to generate fake output data to confuse attackers. Due to its irreversibility, the original correct output data cannot be deduced from the fake output data. Experimental results on a typical processor, i.e., SCR1, show that the proposed scheme causes no time overhead, and the area overhead is only 1.74%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1972-1982"},"PeriodicalIF":2.8,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel High-Throughput FFT Processor With a Block-Level Pipeline for 5G MIMO OFDM Systems","authors":"Meiyu Liu;Zhijun Wang;Hanqing Luo;Shengnan Lin;Liping Liang","doi":"10.1109/TVLSI.2025.3558947","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558947","url":null,"abstract":"In fifth-generation (5G) communication systems, multiple input multiple output (MIMO) and orthogonal frequency-division multiplexing (OFDM) are two critical technologies. Fast Fourier transform (FFT), as the core processing steps of OFDM, directly affects the overall system performance. In this brief, we proposed a novel block-level pipelined architecture, which divides the FFT processor into three pipeline blocks: input, radix, and output. Each pipeline block can run in a different FFT simultaneously to achieve higher throughput. Specifically, to reduce the OFDM system-level latency of 5G applications, the FFT processor supports weighted overlap and add (WOLA) on the cyclic prefix and suffix of OFDM symbols. This architecture is implemented using TSMC 12-nm technology, with a processor die area of 0.89 mm<sup>2</sup> and a power consumption of 568 mW at 1 GHz. The FFT processor can achieve a system-level throughput up to 2.66 GS/s.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2059-2063"},"PeriodicalIF":2.8,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 0.6-V 9.38-Bit 6.9-kS/s Capacitor-Splitting Bypass Window SAR ADC for Wearable 12-Lead ECG Acquisition Systems","authors":"Kangkang Sun;Jingjing Liu;Feng Yan;Yuan Ren;Ruihuang Wu;Bingjun Xiong;Zhipeng Li;Jian Guan","doi":"10.1109/TVLSI.2025.3559669","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3559669","url":null,"abstract":"This article proposes a fully differential ten-bit energy-efficient successive approximation register (SAR) analog-to-digital converter (ADC) for wearable 12-lead electrocardiogram (ECG) acquisition system. The proposed ADC structure generates two bypass windows through capacitor splitting technique, which can skip unnecessary quantization steps. The judgment module of bypass windows only requires an <sc>XOR</small> gate. By introducing redundant capacitors to participate in quantization, the total capacitance value is reduced by half. The proposed SAR ADC is fabricated using a standard 180-nm CMOS process. The measurement results show that it can achieve an effective number of bits (ENOBs) of 9.38 bits and a spurious-free dynamic range (SFDR) of 76.71 dB with a supply voltage of 0.6 V at a sampling rate (<inline-formula> <tex-math>$text{F}_{mathrm {S}}$ </tex-math></inline-formula>) of 6.94 kS/s. The power consumption is 15.61 nW when subjected to a 1.17-<inline-formula> <tex-math>$text{V}_{mathrm {PP}}~3.45$ </tex-math></inline-formula>-kHz sinusoidal input, resulting in a figure of merit (FoM) of 3.38 fJ/conv.-step. The average power consumption for quantizing 12-lead ECG signals is approximately 12.66 nW, demonstrating the ability to achieve ultralow-power quantization of ECG signals.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1838-1847"},"PeriodicalIF":2.8,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A High-Density eDRAM Macro With Programmable Sense Amplifier and TG-Shifter for Logical-Instruction-Based In-Memory Computing","authors":"Kunyao Lai;Enyi Yao;Zhenxing Li;Yongkui Yang","doi":"10.1109/TVLSI.2025.3561507","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3561507","url":null,"abstract":"Embedded DRAM (eDRAM) has been widely adopted as on-chip cache memory in modern processors due to its high density. In this article, we propose a 2T gain-cell eDRAM-based macro that functions not only as traditional cache memory but also as an in-memory computing unit capable of performing logic operations. Furthermore, this eDRAM macro features in situ storing, completely eliminating the need for external memory or register access during computation. The sense amplifier in this macro is equipped with a programmable voltage reference, enabling support for various Boolean logic operations, including <sc>and</small>/<sc>nand</small>, <sc>or</small>/<sc>nor</small>, and <sc>not</small>. In addition, the macro integrates a transmission-gate (TG)-based shifter cluster to perform data shifting, which is commonly required in general computations. To enhance functionality, we design an instruction set that supports compound logic computations, allowing Boolean logic, shifting, and in situ storage to be executed within a single instruction. We validated this eDRAM macro in a 32-kb bitcell array using the 40-nm logic CMOS technology. Compared with state-of-the-art designs, our macro achieves a relatively high density of 729.2 kb/mm<sup>2</sup> and a competitive logic energy of 14.1 fJ/bit.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2069-2073"},"PeriodicalIF":2.8,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdul Rahoof;Vivek Chaturvedi;Mahesh Raveendranatha Panicker;Muhammad Shafique
{"title":"CapsBeam: Accelerating Capsule Network-Based Beamformer for Ultrasound Nonsteered Plane-Wave Imaging on Field-Programmable Gate Array","authors":"Abdul Rahoof;Vivek Chaturvedi;Mahesh Raveendranatha Panicker;Muhammad Shafique","doi":"10.1109/TVLSI.2025.3559403","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3559403","url":null,"abstract":"In recent years, there has been a growing trend in accelerating computationally complex nonreal-time beamforming algorithms in ultrasound imaging using deep learning models. However, due to the large size and complexity, these state-of-the-art deep learning techniques pose significant challenges when deploying on resource-constrained edge devices. In this work, we propose a novel capsule network-based beamformer called CapsBeam, designed to operate on raw radio frequency data and provide an envelope of beamformed data through nonsteered plane-wave insonification. In experiments on in vivo data, CapsBeam reduced artifacts compared to the standard Delay-and-Sum (DAS) beamforming. For in vitro data, CapsBeam demonstrated a 32.31% increase in contrast, along with gains of 16.54% and 6.7% in axial and lateral resolution compared to the DAS. Similarly, in silico data showed a 26% enhancement in contrast, along with improvements of 13.6% and 21.5% in axial and lateral resolution, respectively, compared to the DAS. To reduce the parameter redundancy and enhance the computational efficiency, we pruned the model using our multilayer look-ahead kernel pruning (LAKP-ML) methodology, achieving a compression ratio of 85% without affecting the image quality. Additionally, the hardware complexity of the proposed model is reduced by applying quantization, simplification of nonlinear operations, and parallelizing operations. Finally, we proposed a specialized accelerator architecture for the pruned and optimized CapsBeam model, implemented on a Xilinx ZU7EV FPGA. The proposed accelerator achieved a throughput of 30 GOPS for the convolution operation and 17.4 GOPS for the dynamic routing operation.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1934-1944"},"PeriodicalIF":2.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information","authors":"","doi":"10.1109/TVLSI.2025.3557605","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557605","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"C3-C3"},"PeriodicalIF":2.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10977653","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information","authors":"","doi":"10.1109/TVLSI.2025.3557603","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557603","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"C2-C2"},"PeriodicalIF":2.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10977654","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Upscale Layer Acceleration on Existing AI Hardware","authors":"Vuk Vranjkovic;Predrag Teodorovic;Rastislav Struharik","doi":"10.1109/TVLSI.2025.3558946","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558946","url":null,"abstract":"Upscaling layers are important components of modern deep learning networks but often pose computational challenges for hardware (HW) accelerators. This article addresses this issue by introducing a novel layer-replacement technique to efficiently process upscaling layers using existing hardware-supported operations like depthwise convolutions and maximum pooling. To minimize the number of replacement layers, we propose an efficient layer number reduction algorithm. Experimental results on four deep neural networks demonstrate a significant speedup ranging from <inline-formula> <tex-math>$1.58times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$32.88times $ </tex-math></inline-formula> compared to the original HW/software (SW) execution approach, and from <inline-formula> <tex-math>$3.65times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$19.21times $ </tex-math></inline-formula> compared to the software-only solution, with minimal hardware overhead (0.068% more field-programmable gate array (FPGA) look-up tables (LUTs)). Notably, our technique introduces no numerical errors and maintains comparable input data processing quality to the original network.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1624-1637"},"PeriodicalIF":2.8,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuta Nagahara;Jiale Yan;Kazushi Kawamura;Daichi Fujiki;Masato Motomura;Thiem Van Chu
{"title":"DMSA: An Efficient Architecture for Sparse–Sparse Matrix Multiplication Based on Distribute-Merge Product Dataflow","authors":"Yuta Nagahara;Jiale Yan;Kazushi Kawamura;Daichi Fujiki;Masato Motomura;Thiem Van Chu","doi":"10.1109/TVLSI.2025.3558895","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558895","url":null,"abstract":"The sparse–sparse matrix multiplication (SpMSpM) is a fundamental operation in various applications. Existing SpMSpM accelerators based on inner product (IP) and outer product (OP) suffer from low computational efficiency and high memory traffic due to inefficient index matching and merging overheads. Gustavson’s product (GP)-based accelerators mitigate some of these challenges but struggle with workload imbalance and irregular memory access patterns, limiting computational parallelism. To overcome these limitations, we propose a distribute-merge product (DMP), a novel SpMSpM dataflow that evenly distributes workloads across multiple computation streams and merges partial results efficiently. We design and implement DMP-based SpMSpM architecture (DMSA), incorporating four key techniques to fully exploit the parallelism of DMP and efficiently handle irregular memory accesses. Implemented on a Xilinx ZCU106 FPGA, DMSA achieves speedups of up to <inline-formula> <tex-math>$3.38times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.73times $ </tex-math></inline-formula> over two state-of-the-art FPGA-based SpMSpM accelerators while maintaining comparable hardware resource usage. In addition, compared to CPU and GPU implementations on an NVIDIA Jetson AGX Xavier, DMSA is <inline-formula> <tex-math>$4.96times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.53times $ </tex-math></inline-formula> faster while achieving <inline-formula> <tex-math>$6.67times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$2.33times $ </tex-math></inline-formula> better energy efficiency, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1858-1871"},"PeriodicalIF":2.8,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheng Nian;Xiaorui Mo;Weiyi Zhang;Fasih Ud Din Farrukh;Yushi Guo;Fei Chen;Chun Zhang
{"title":"An 197-μJ/Frame Single-Frame Bundle Adjustment Hardware Accelerator for Mobile Visual Odometry","authors":"Cheng Nian;Xiaorui Mo;Weiyi Zhang;Fasih Ud Din Farrukh;Yushi Guo;Fei Chen;Chun Zhang","doi":"10.1109/TVLSI.2025.3557872","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557872","url":null,"abstract":"This article presents an energy-efficient hardware accelerator for optimized bundle adjustment (BA) for mobile high-frame-rate visual odometry (VO). BA uses graph optimization techniques to optimize poses and landmarks and the applications are robot navigation, virtual reality (VR), and augmented reality (AR). Existing software implementations of BA optimization involve complex computational flows, numerical calculations, Lie group, and Lie algebra conversions. This poses challenges of slow computational speeds and high power consumption. A two-level reuse hardware architecture is proposed and implemented that efficiently updates the Jacobian matrix while reducing the field-programmable gate array (FPGA) hardware resources by 25%. A set of methodologies is proposed to quantify the errors caused by fixed-point systems during optimization. A fully pipelined architecture is implemented to increase computational speed while reducing hardware resources by 29%. This design features a parallel equation solver that improves processing speed by <inline-formula> <tex-math>$2times $ </tex-math></inline-formula> compared to conventional approaches. This article employs a single-frame local BA VO on the KITTI dataset and EuRoC dataset, achieving an average translational error of 0.75% and a rotational error of <inline-formula> <tex-math>$0.0028~^{circ } $ </tex-math></inline-formula>/m. The proposed hardware achieves a performance ranging from 188 to 345 frames/s in optimizing two main feature extraction methods with a maximum of 512 extracted feature points. Compared to state-of-the-art implementations, the accelerator achieved a minimum energy efficiency ratio of 11.6 mJ and <inline-formula> <tex-math>$191~mu $ </tex-math></inline-formula>J on the FPGA platform and application-specific integrated circuits (ASICs) platform, respectively. These improvements underscore the potential of FPGAs to enhance VO systems’ adaptability and efficiency in complex environments.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1872-1885"},"PeriodicalIF":2.8,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}