IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献_第2页

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information 超大规模集成电路（VLSI）系统学报

IF 2.8 2区工程技术

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-25 DOI: 10.1109/TVLSI.2025.3557605

引用次数: 0

CapsBeam: Accelerating Capsule Network-Based Beamformer for Ultrasound Nonsteered Plane-Wave Imaging on Field-Programmable Gate Array CapsBeam：用于现场可编程门阵列超声无操纵平面波成像的加速胶囊网络波束形成器

IF 2.8 2区工程技术

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-25 DOI: 10.1109/TVLSI.2025.3559403

Abdul Rahoof;Vivek Chaturvedi;Mahesh Raveendranatha Panicker;Muhammad Shafique

{"title":"CapsBeam: Accelerating Capsule Network-Based Beamformer for Ultrasound Nonsteered Plane-Wave Imaging on Field-Programmable Gate Array","authors":"Abdul Rahoof;Vivek Chaturvedi;Mahesh Raveendranatha Panicker;Muhammad Shafique","doi":"10.1109/TVLSI.2025.3559403","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3559403","url":null,"abstract":"In recent years, there has been a growing trend in accelerating computationally complex nonreal-time beamforming algorithms in ultrasound imaging using deep learning models. However, due to the large size and complexity, these state-of-the-art deep learning techniques pose significant challenges when deploying on resource-constrained edge devices. In this work, we propose a novel capsule network-based beamformer called CapsBeam, designed to operate on raw radio frequency data and provide an envelope of beamformed data through nonsteered plane-wave insonification. In experiments on in vivo data, CapsBeam reduced artifacts compared to the standard Delay-and-Sum (DAS) beamforming. For in vitro data, CapsBeam demonstrated a 32.31% increase in contrast, along with gains of 16.54% and 6.7% in axial and lateral resolution compared to the DAS. Similarly, in silico data showed a 26% enhancement in contrast, along with improvements of 13.6% and 21.5% in axial and lateral resolution, respectively, compared to the DAS. To reduce the parameter redundancy and enhance the computational efficiency, we pruned the model using our multilayer look-ahead kernel pruning (LAKP-ML) methodology, achieving a compression ratio of 85% without affecting the image quality. Additionally, the hardware complexity of the proposed model is reduced by applying quantization, simplification of nonlinear operations, and parallelizing operations. Finally, we proposed a specialized accelerator architecture for the pruned and optimized CapsBeam model, implemented on a Xilinx ZU7EV FPGA. The proposed accelerator achieved a throughput of 30 GOPS for the convolution operation and 17.4 GOPS for the dynamic routing operation.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1934-1944"},"PeriodicalIF":2.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information IEEE 超大规模集成 (VLSI) 系统论文集出版信息

IF 2.8 2区工程技术

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-25 DOI: 10.1109/TVLSI.2025.3557603

引用次数: 0

Upscale Layer Acceleration on Existing AI Hardware 现有AI硬件的高级图层加速

IF 2.8 2区工程技术

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-23 DOI: 10.1109/TVLSI.2025.3558946

Vuk Vranjkovic;Predrag Teodorovic;Rastislav Struharik

{"title":"Upscale Layer Acceleration on Existing AI Hardware","authors":"Vuk Vranjkovic;Predrag Teodorovic;Rastislav Struharik","doi":"10.1109/TVLSI.2025.3558946","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558946","url":null,"abstract":"Upscaling layers are important components of modern deep learning networks but often pose computational challenges for hardware (HW) accelerators. This article addresses this issue by introducing a novel layer-replacement technique to efficiently process upscaling layers using existing hardware-supported operations like depthwise convolutions and maximum pooling. To minimize the number of replacement layers, we propose an efficient layer number reduction algorithm. Experimental results on four deep neural networks demonstrate a significant speedup ranging from <inline-formula> <tex-math>$1.58times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$32.88times $ </tex-math></inline-formula> compared to the original HW/software (SW) execution approach, and from <inline-formula> <tex-math>$3.65times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$19.21times $ </tex-math></inline-formula> compared to the software-only solution, with minimal hardware overhead (0.068% more field-programmable gate array (FPGA) look-up tables (LUTs)). Notably, our technique introduces no numerical errors and maintains comparable input data processing quality to the original network.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1624-1637"},"PeriodicalIF":2.8,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DMSA: An Efficient Architecture for Sparse–Sparse Matrix Multiplication Based on Distribute-Merge Product Dataflow DMSA：一种基于分布-合并产品数据流的稀疏-稀疏矩阵乘法的高效架构

IF 2.8 2区工程技术

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-23 DOI: 10.1109/TVLSI.2025.3558895

Yuta Nagahara;Jiale Yan;Kazushi Kawamura;Daichi Fujiki;Masato Motomura;Thiem Van Chu

{"title":"DMSA: An Efficient Architecture for Sparse–Sparse Matrix Multiplication Based on Distribute-Merge Product Dataflow","authors":"Yuta Nagahara;Jiale Yan;Kazushi Kawamura;Daichi Fujiki;Masato Motomura;Thiem Van Chu","doi":"10.1109/TVLSI.2025.3558895","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558895","url":null,"abstract":"The sparse–sparse matrix multiplication (SpMSpM) is a fundamental operation in various applications. Existing SpMSpM accelerators based on inner product (IP) and outer product (OP) suffer from low computational efficiency and high memory traffic due to inefficient index matching and merging overheads. Gustavson’s product (GP)-based accelerators mitigate some of these challenges but struggle with workload imbalance and irregular memory access patterns, limiting computational parallelism. To overcome these limitations, we propose a distribute-merge product (DMP), a novel SpMSpM dataflow that evenly distributes workloads across multiple computation streams and merges partial results efficiently. We design and implement DMP-based SpMSpM architecture (DMSA), incorporating four key techniques to fully exploit the parallelism of DMP and efficiently handle irregular memory accesses. Implemented on a Xilinx ZCU106 FPGA, DMSA achieves speedups of up to <inline-formula> <tex-math>$3.38times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.73times $ </tex-math></inline-formula> over two state-of-the-art FPGA-based SpMSpM accelerators while maintaining comparable hardware resource usage. In addition, compared to CPU and GPU implementations on an NVIDIA Jetson AGX Xavier, DMSA is <inline-formula> <tex-math>$4.96times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.53times $ </tex-math></inline-formula> faster while achieving <inline-formula> <tex-math>$6.67times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$2.33times $ </tex-math></inline-formula> better energy efficiency, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1858-1871"},"PeriodicalIF":2.8,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An 197-μJ/Frame Single-Frame Bundle Adjustment Hardware Accelerator for Mobile Visual Odometry 197 μ j /帧单帧束调整硬件加速器

IF 2.8 2区工程技术

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-22 DOI: 10.1109/TVLSI.2025.3557872

Cheng Nian;Xiaorui Mo;Weiyi Zhang;Fasih Ud Din Farrukh;Yushi Guo;Fei Chen;Chun Zhang

{"title":"An 197-μJ/Frame Single-Frame Bundle Adjustment Hardware Accelerator for Mobile Visual Odometry","authors":"Cheng Nian;Xiaorui Mo;Weiyi Zhang;Fasih Ud Din Farrukh;Yushi Guo;Fei Chen;Chun Zhang","doi":"10.1109/TVLSI.2025.3557872","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557872","url":null,"abstract":"This article presents an energy-efficient hardware accelerator for optimized bundle adjustment (BA) for mobile high-frame-rate visual odometry (VO). BA uses graph optimization techniques to optimize poses and landmarks and the applications are robot navigation, virtual reality (VR), and augmented reality (AR). Existing software implementations of BA optimization involve complex computational flows, numerical calculations, Lie group, and Lie algebra conversions. This poses challenges of slow computational speeds and high power consumption. A two-level reuse hardware architecture is proposed and implemented that efficiently updates the Jacobian matrix while reducing the field-programmable gate array (FPGA) hardware resources by 25%. A set of methodologies is proposed to quantify the errors caused by fixed-point systems during optimization. A fully pipelined architecture is implemented to increase computational speed while reducing hardware resources by 29%. This design features a parallel equation solver that improves processing speed by <inline-formula> <tex-math>$2times $ </tex-math></inline-formula> compared to conventional approaches. This article employs a single-frame local BA VO on the KITTI dataset and EuRoC dataset, achieving an average translational error of 0.75% and a rotational error of <inline-formula> <tex-math>$0.0028~^{circ } $ </tex-math></inline-formula>/m. The proposed hardware achieves a performance ranging from 188 to 345 frames/s in optimizing two main feature extraction methods with a maximum of 512 extracted feature points. Compared to state-of-the-art implementations, the accelerator achieved a minimum energy efficiency ratio of 11.6 mJ and <inline-formula> <tex-math>$191~mu $ </tex-math></inline-formula>J on the FPGA platform and application-specific integrated circuits (ASICs) platform, respectively. These improvements underscore the potential of FPGAs to enhance VO systems’ adaptability and efficiency in complex environments.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1872-1885"},"PeriodicalIF":2.8,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Switched-Based Slew Rate and Gain Boosting Parallel-Path Amplifier for Switched-Capacitor Applications 一种用于开关电容的开关型摆率和增益提升并联路径放大器

IF 2.8 2区工程技术

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-22 DOI: 10.1109/TVLSI.2025.3557467

Javad Bagheri Asli;Alireza Saberkari;Atila Alvandpour

{"title":"A Switched-Based Slew Rate and Gain Boosting Parallel-Path Amplifier for Switched-Capacitor Applications","authors":"Javad Bagheri Asli;Alireza Saberkari;Atila Alvandpour","doi":"10.1109/TVLSI.2025.3557467","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557467","url":null,"abstract":"A parallel-path amplifier (PPA) incorporating a switched-based slew rate and gain boosting stage as a feed-forward path, in parallel with a linear amplifier is introduced in this brief as an alternative to conventional analog amplifiers to achieve a high accuracy through the linear path and high slewing through the assisted feed-forward path. The feed-forward path employs a pre-amplifier, hysteresis-detector, and differential charge pumps, while the linear path includes a recycling folded-cascode amplifier. An analysis is performed to study the amplifier’s settling error with and without the feed-forward path, and also the trade-off between the dead-zone width of the hysteresis detector and the amplifier’s settling speed. The assisted feed-forward path has improved the slew rate <inline-formula> <tex-math>$times 2.5$ </tex-math></inline-formula>–800 V/<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>s, effective GBW by 15%, and dc gain by 16 dB at the expense of adding <inline-formula> <tex-math>$187.5~mu $ </tex-math></inline-formula>A extra current consumption and <inline-formula> <tex-math>$1.25~mu $ </tex-math></inline-formula>m<sup>2</sup> extra silicon area. To prove the concept, the proposed amplifier is used as a multiplying digital-to-analog converter (MDAC) amplifier of an 8-bit pipeline analog-to-digital converter (ADC), and the ADC is fabricated in a 65-nm CMOS process. The results reveal that the spurious free dynamic range (SFDR) and signal-to-noise and distortion ratio (SNDR) performances are improved by 6–7 dB in the presence of the feed-forward path.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1799-1802"},"PeriodicalIF":2.8,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Wireless PHY With Adaptive OFDM and Multiarmed Bandit Learning on Zynq System-on-Chip 基于Zynq片上系统的自适应OFDM和多臂强盗学习增强无线PHY

IF 2.8 2区工程技术

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-14 DOI: 10.1109/TVLSI.2025.3528865

Neelam Singh;Sumit J. Darak

{"title":"Enhancing Wireless PHY With Adaptive OFDM and Multiarmed Bandit Learning on Zynq System-on-Chip","authors":"Neelam Singh;Sumit J. Darak","doi":"10.1109/TVLSI.2025.3528865","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3528865","url":null,"abstract":"In this work, we present an intelligent and reconfigurable wireless physical layer (PHY) that dynamically adjusts the transmission parameters for a given radio frequency (RF) environment. The proposed PHY is based on orthogonal frequency division multiplexing (OFDM) and can dynamically augment OFDM with a finite impulse response (FIR) low-pass filter to improve the out-of-band emissions (OOBE). To make these adaptations intelligently, we employ multiarmed bandit (MAB)-based online learning algorithms, specifically upper confidence bound with control variate (UCB-CV). UCB-CV enhances traditional UCB by incorporating additional information such as interference level and transmit power, allowing it to manage interference more effectively. These algorithms are integrated into the PHY of an FPGA-based OFDM transceiver on the Zynq system-on-chip (SoC), facilitating real-time decision-making based on side-channel interference and other parameters. Our comparative analysis highlights the enhanced performance of the UCB-CV algorithm over the traditional UCB in terms of reducing the bit-error rate (BER) and managing interference more effectively. Unlike the traditional UCB, UCB-CV leverages side information through a control variate approach, incorporating the coefficient of variation (CV) into reward estimation to better handle interference. Additionally, we underline the advantages of filtered-OFDM (FOFDM) compared to standard OFDM. Notably, FOFDM significantly reduces OOBE by 20–75 dBW/Hz and improves BER. In environments with high interference, UCB-CV achieves a throughput improvement of 29.54% compared to UCB.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1651-1664"},"PeriodicalIF":2.8,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Implementing Homomorphic Encryption-Based Logic Locking in System-On-Chip Designs 在片上系统设计中实现同态加密逻辑锁定

IF 2.8 2区工程技术

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-10 DOI: 10.1109/TVLSI.2025.3556241

Ziyang Ye;Makoto Ikeda

引用次数: 0

Real-Time Driver Monitoring: Implementing FPGA-Accelerated CNNs for Pose Detection 实时驾驶员监控：实现姿态检测的fpga加速cnn

IF 2.8 2区工程技术

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-10 DOI: 10.1109/TVLSI.2025.3554880

Minjoon Kim;Jaehyuk So

{"title":"Real-Time Driver Monitoring: Implementing FPGA-Accelerated CNNs for Pose Detection","authors":"Minjoon Kim;Jaehyuk So","doi":"10.1109/TVLSI.2025.3554880","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3554880","url":null,"abstract":"As autonomous driving technology advances at an unprecedented pace, drivers are experiencing greater freedom within their vehicles, which accelerates the development of various intelligent systems to support safe and more efficient driving. These intelligent systems provide interactive applications between the vehicle and the driver, utilizing driver behavior analysis (DBA). A key performance indicator is real-time driver monitoring quality, as it directly impacts both safety and convenience in vehicle operation. In order to achieve real-time interaction, an image processing speed exceeding 30 frames/s and a delay time (latency) below 100 ms are generally required. However, expensive devices are often necessary to support this with software. Therefore, this article presents an algorithm and implementation results for immediate in-vehicle DBA through field-programmable gate array (FPGA)-based high-speed upper body-pose estimation. First, we define the 11 key points related to the driver’s pose and gaze and model a convolutional neural network (CNN) architecture that can quickly detect them. The proposed algorithm utilizes regeneration and retraining through layer reduction based on the residual-CNN model. In addition, the algorithm presents the results of its implementation at the register transfer level (RTL) level of the VCU118 FPGA and demonstrates simulation results of 34.7 frames/s and a delay time of 75.3 ms. Lastly, we discuss the results of linking a demo application and creating a vehicle testbed to experiment with the driver–vehicle interaction (DVI) system. A developed FPGA platform is implemented to process camera image input in real time. It reliably supports detected pose and gaze results at 30 frames/s via Ethernet. It also presents results that verify its application in screen control and driver monitoring systems.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1848-1857"},"PeriodicalIF":2.8,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0