{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information","authors":"","doi":"10.1109/TVLSI.2025.3557605","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557605","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"C3-C3"},"PeriodicalIF":2.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10977653","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdul Rahoof;Vivek Chaturvedi;Mahesh Raveendranatha Panicker;Muhammad Shafique
{"title":"CapsBeam: Accelerating Capsule Network-Based Beamformer for Ultrasound Nonsteered Plane-Wave Imaging on Field-Programmable Gate Array","authors":"Abdul Rahoof;Vivek Chaturvedi;Mahesh Raveendranatha Panicker;Muhammad Shafique","doi":"10.1109/TVLSI.2025.3559403","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3559403","url":null,"abstract":"In recent years, there has been a growing trend in accelerating computationally complex nonreal-time beamforming algorithms in ultrasound imaging using deep learning models. However, due to the large size and complexity, these state-of-the-art deep learning techniques pose significant challenges when deploying on resource-constrained edge devices. In this work, we propose a novel capsule network-based beamformer called CapsBeam, designed to operate on raw radio frequency data and provide an envelope of beamformed data through nonsteered plane-wave insonification. In experiments on in vivo data, CapsBeam reduced artifacts compared to the standard Delay-and-Sum (DAS) beamforming. For in vitro data, CapsBeam demonstrated a 32.31% increase in contrast, along with gains of 16.54% and 6.7% in axial and lateral resolution compared to the DAS. Similarly, in silico data showed a 26% enhancement in contrast, along with improvements of 13.6% and 21.5% in axial and lateral resolution, respectively, compared to the DAS. To reduce the parameter redundancy and enhance the computational efficiency, we pruned the model using our multilayer look-ahead kernel pruning (LAKP-ML) methodology, achieving a compression ratio of 85% without affecting the image quality. Additionally, the hardware complexity of the proposed model is reduced by applying quantization, simplification of nonlinear operations, and parallelizing operations. Finally, we proposed a specialized accelerator architecture for the pruned and optimized CapsBeam model, implemented on a Xilinx ZU7EV FPGA. The proposed accelerator achieved a throughput of 30 GOPS for the convolution operation and 17.4 GOPS for the dynamic routing operation.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1934-1944"},"PeriodicalIF":2.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information","authors":"","doi":"10.1109/TVLSI.2025.3557603","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557603","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"C2-C2"},"PeriodicalIF":2.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10977654","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Upscale Layer Acceleration on Existing AI Hardware","authors":"Vuk Vranjkovic;Predrag Teodorovic;Rastislav Struharik","doi":"10.1109/TVLSI.2025.3558946","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558946","url":null,"abstract":"Upscaling layers are important components of modern deep learning networks but often pose computational challenges for hardware (HW) accelerators. This article addresses this issue by introducing a novel layer-replacement technique to efficiently process upscaling layers using existing hardware-supported operations like depthwise convolutions and maximum pooling. To minimize the number of replacement layers, we propose an efficient layer number reduction algorithm. Experimental results on four deep neural networks demonstrate a significant speedup ranging from <inline-formula> <tex-math>$1.58times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$32.88times $ </tex-math></inline-formula> compared to the original HW/software (SW) execution approach, and from <inline-formula> <tex-math>$3.65times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$19.21times $ </tex-math></inline-formula> compared to the software-only solution, with minimal hardware overhead (0.068% more field-programmable gate array (FPGA) look-up tables (LUTs)). Notably, our technique introduces no numerical errors and maintains comparable input data processing quality to the original network.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1624-1637"},"PeriodicalIF":2.8,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuta Nagahara;Jiale Yan;Kazushi Kawamura;Daichi Fujiki;Masato Motomura;Thiem Van Chu
{"title":"DMSA: An Efficient Architecture for Sparse–Sparse Matrix Multiplication Based on Distribute-Merge Product Dataflow","authors":"Yuta Nagahara;Jiale Yan;Kazushi Kawamura;Daichi Fujiki;Masato Motomura;Thiem Van Chu","doi":"10.1109/TVLSI.2025.3558895","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558895","url":null,"abstract":"The sparse–sparse matrix multiplication (SpMSpM) is a fundamental operation in various applications. Existing SpMSpM accelerators based on inner product (IP) and outer product (OP) suffer from low computational efficiency and high memory traffic due to inefficient index matching and merging overheads. Gustavson’s product (GP)-based accelerators mitigate some of these challenges but struggle with workload imbalance and irregular memory access patterns, limiting computational parallelism. To overcome these limitations, we propose a distribute-merge product (DMP), a novel SpMSpM dataflow that evenly distributes workloads across multiple computation streams and merges partial results efficiently. We design and implement DMP-based SpMSpM architecture (DMSA), incorporating four key techniques to fully exploit the parallelism of DMP and efficiently handle irregular memory accesses. Implemented on a Xilinx ZCU106 FPGA, DMSA achieves speedups of up to <inline-formula> <tex-math>$3.38times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.73times $ </tex-math></inline-formula> over two state-of-the-art FPGA-based SpMSpM accelerators while maintaining comparable hardware resource usage. In addition, compared to CPU and GPU implementations on an NVIDIA Jetson AGX Xavier, DMSA is <inline-formula> <tex-math>$4.96times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.53times $ </tex-math></inline-formula> faster while achieving <inline-formula> <tex-math>$6.67times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$2.33times $ </tex-math></inline-formula> better energy efficiency, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1858-1871"},"PeriodicalIF":2.8,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheng Nian;Xiaorui Mo;Weiyi Zhang;Fasih Ud Din Farrukh;Yushi Guo;Fei Chen;Chun Zhang
{"title":"An 197-μJ/Frame Single-Frame Bundle Adjustment Hardware Accelerator for Mobile Visual Odometry","authors":"Cheng Nian;Xiaorui Mo;Weiyi Zhang;Fasih Ud Din Farrukh;Yushi Guo;Fei Chen;Chun Zhang","doi":"10.1109/TVLSI.2025.3557872","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557872","url":null,"abstract":"This article presents an energy-efficient hardware accelerator for optimized bundle adjustment (BA) for mobile high-frame-rate visual odometry (VO). BA uses graph optimization techniques to optimize poses and landmarks and the applications are robot navigation, virtual reality (VR), and augmented reality (AR). Existing software implementations of BA optimization involve complex computational flows, numerical calculations, Lie group, and Lie algebra conversions. This poses challenges of slow computational speeds and high power consumption. A two-level reuse hardware architecture is proposed and implemented that efficiently updates the Jacobian matrix while reducing the field-programmable gate array (FPGA) hardware resources by 25%. A set of methodologies is proposed to quantify the errors caused by fixed-point systems during optimization. A fully pipelined architecture is implemented to increase computational speed while reducing hardware resources by 29%. This design features a parallel equation solver that improves processing speed by <inline-formula> <tex-math>$2times $ </tex-math></inline-formula> compared to conventional approaches. This article employs a single-frame local BA VO on the KITTI dataset and EuRoC dataset, achieving an average translational error of 0.75% and a rotational error of <inline-formula> <tex-math>$0.0028~^{circ } $ </tex-math></inline-formula>/m. The proposed hardware achieves a performance ranging from 188 to 345 frames/s in optimizing two main feature extraction methods with a maximum of 512 extracted feature points. Compared to state-of-the-art implementations, the accelerator achieved a minimum energy efficiency ratio of 11.6 mJ and <inline-formula> <tex-math>$191~mu $ </tex-math></inline-formula>J on the FPGA platform and application-specific integrated circuits (ASICs) platform, respectively. These improvements underscore the potential of FPGAs to enhance VO systems’ adaptability and efficiency in complex environments.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1872-1885"},"PeriodicalIF":2.8,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Switched-Based Slew Rate and Gain Boosting Parallel-Path Amplifier for Switched-Capacitor Applications","authors":"Javad Bagheri Asli;Alireza Saberkari;Atila Alvandpour","doi":"10.1109/TVLSI.2025.3557467","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557467","url":null,"abstract":"A parallel-path amplifier (PPA) incorporating a switched-based slew rate and gain boosting stage as a feed-forward path, in parallel with a linear amplifier is introduced in this brief as an alternative to conventional analog amplifiers to achieve a high accuracy through the linear path and high slewing through the assisted feed-forward path. The feed-forward path employs a pre-amplifier, hysteresis-detector, and differential charge pumps, while the linear path includes a recycling folded-cascode amplifier. An analysis is performed to study the amplifier’s settling error with and without the feed-forward path, and also the trade-off between the dead-zone width of the hysteresis detector and the amplifier’s settling speed. The assisted feed-forward path has improved the slew rate <inline-formula> <tex-math>$times 2.5$ </tex-math></inline-formula>–800 V/<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>s, effective GBW by 15%, and dc gain by 16 dB at the expense of adding <inline-formula> <tex-math>$187.5~mu $ </tex-math></inline-formula>A extra current consumption and <inline-formula> <tex-math>$1.25~mu $ </tex-math></inline-formula>m<sup>2</sup> extra silicon area. To prove the concept, the proposed amplifier is used as a multiplying digital-to-analog converter (MDAC) amplifier of an 8-bit pipeline analog-to-digital converter (ADC), and the ADC is fabricated in a 65-nm CMOS process. The results reveal that the spurious free dynamic range (SFDR) and signal-to-noise and distortion ratio (SNDR) performances are improved by 6–7 dB in the presence of the feed-forward path.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1799-1802"},"PeriodicalIF":2.8,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Wireless PHY With Adaptive OFDM and Multiarmed Bandit Learning on Zynq System-on-Chip","authors":"Neelam Singh;Sumit J. Darak","doi":"10.1109/TVLSI.2025.3528865","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3528865","url":null,"abstract":"In this work, we present an intelligent and reconfigurable wireless physical layer (PHY) that dynamically adjusts the transmission parameters for a given radio frequency (RF) environment. The proposed PHY is based on orthogonal frequency division multiplexing (OFDM) and can dynamically augment OFDM with a finite impulse response (FIR) low-pass filter to improve the out-of-band emissions (OOBE). To make these adaptations intelligently, we employ multiarmed bandit (MAB)-based online learning algorithms, specifically upper confidence bound with control variate (UCB-CV). UCB-CV enhances traditional UCB by incorporating additional information such as interference level and transmit power, allowing it to manage interference more effectively. These algorithms are integrated into the PHY of an FPGA-based OFDM transceiver on the Zynq system-on-chip (SoC), facilitating real-time decision-making based on side-channel interference and other parameters. Our comparative analysis highlights the enhanced performance of the UCB-CV algorithm over the traditional UCB in terms of reducing the bit-error rate (BER) and managing interference more effectively. Unlike the traditional UCB, UCB-CV leverages side information through a control variate approach, incorporating the coefficient of variation (CV) into reward estimation to better handle interference. Additionally, we underline the advantages of filtered-OFDM (FOFDM) compared to standard OFDM. Notably, FOFDM significantly reduces OOBE by 20–75 dBW/Hz and improves BER. In environments with high interference, UCB-CV achieves a throughput improvement of 29.54% compared to UCB.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1651-1664"},"PeriodicalIF":2.8,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementing Homomorphic Encryption-Based Logic Locking in System-On-Chip Designs","authors":"Ziyang Ye;Makoto Ikeda","doi":"10.1109/TVLSI.2025.3556241","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3556241","url":null,"abstract":"This study presents a logic-locking scheme based on the binary ring learning with error (bin-RLWE) algorithm, implemented in a reduced instruction set computer-five (RISC-V) system-on-chip (SoC) design. Unlike traditional logic-locking methods that require providing users with raw locking parameters, the proposed approach secures critical logic paths in the privilege switching process without exposing these sensitive parameters. The implemented locking module itself consumes 3519 lookup tables (LUTs) and 2645 registers, leading to an overall overhead of 6.0% in LUTs and 6.9% in registers compared to the baseline system. The unlock process requires about <inline-formula> <tex-math>$2.6~mu $ </tex-math></inline-formula>s, introducing moderate performance impact and primarily affecting system-level operations while preserving user-level computational efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2049-2053"},"PeriodicalIF":2.8,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-Time Driver Monitoring: Implementing FPGA-Accelerated CNNs for Pose Detection","authors":"Minjoon Kim;Jaehyuk So","doi":"10.1109/TVLSI.2025.3554880","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3554880","url":null,"abstract":"As autonomous driving technology advances at an unprecedented pace, drivers are experiencing greater freedom within their vehicles, which accelerates the development of various intelligent systems to support safe and more efficient driving. These intelligent systems provide interactive applications between the vehicle and the driver, utilizing driver behavior analysis (DBA). A key performance indicator is real-time driver monitoring quality, as it directly impacts both safety and convenience in vehicle operation. In order to achieve real-time interaction, an image processing speed exceeding 30 frames/s and a delay time (latency) below 100 ms are generally required. However, expensive devices are often necessary to support this with software. Therefore, this article presents an algorithm and implementation results for immediate in-vehicle DBA through field-programmable gate array (FPGA)-based high-speed upper body-pose estimation. First, we define the 11 key points related to the driver’s pose and gaze and model a convolutional neural network (CNN) architecture that can quickly detect them. The proposed algorithm utilizes regeneration and retraining through layer reduction based on the residual-CNN model. In addition, the algorithm presents the results of its implementation at the register transfer level (RTL) level of the VCU118 FPGA and demonstrates simulation results of 34.7 frames/s and a delay time of 75.3 ms. Lastly, we discuss the results of linking a demo application and creating a vehicle testbed to experiment with the driver–vehicle interaction (DVI) system. A developed FPGA platform is implemented to process camera image input in real time. It reliably supports detected pose and gaze results at 30 frames/s via Ethernet. It also presents results that verify its application in screen control and driver monitoring systems.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1848-1857"},"PeriodicalIF":2.8,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}