IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

筛选
英文 中文
Upscale Layer Acceleration on Existing AI Hardware 现有AI硬件的高级图层加速
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-23 DOI: 10.1109/TVLSI.2025.3558946
Vuk Vranjkovic;Predrag Teodorovic;Rastislav Struharik
{"title":"Upscale Layer Acceleration on Existing AI Hardware","authors":"Vuk Vranjkovic;Predrag Teodorovic;Rastislav Struharik","doi":"10.1109/TVLSI.2025.3558946","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558946","url":null,"abstract":"Upscaling layers are important components of modern deep learning networks but often pose computational challenges for hardware (HW) accelerators. This article addresses this issue by introducing a novel layer-replacement technique to efficiently process upscaling layers using existing hardware-supported operations like depthwise convolutions and maximum pooling. To minimize the number of replacement layers, we propose an efficient layer number reduction algorithm. Experimental results on four deep neural networks demonstrate a significant speedup ranging from <inline-formula> <tex-math>$1.58times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$32.88times $ </tex-math></inline-formula> compared to the original HW/software (SW) execution approach, and from <inline-formula> <tex-math>$3.65times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$19.21times $ </tex-math></inline-formula> compared to the software-only solution, with minimal hardware overhead (0.068% more field-programmable gate array (FPGA) look-up tables (LUTs)). Notably, our technique introduces no numerical errors and maintains comparable input data processing quality to the original network.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1624-1637"},"PeriodicalIF":2.8,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DMSA: An Efficient Architecture for Sparse–Sparse Matrix Multiplication Based on Distribute-Merge Product Dataflow DMSA:一种基于分布-合并产品数据流的稀疏-稀疏矩阵乘法的高效架构
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-23 DOI: 10.1109/TVLSI.2025.3558895
Yuta Nagahara;Jiale Yan;Kazushi Kawamura;Daichi Fujiki;Masato Motomura;Thiem Van Chu
{"title":"DMSA: An Efficient Architecture for Sparse–Sparse Matrix Multiplication Based on Distribute-Merge Product Dataflow","authors":"Yuta Nagahara;Jiale Yan;Kazushi Kawamura;Daichi Fujiki;Masato Motomura;Thiem Van Chu","doi":"10.1109/TVLSI.2025.3558895","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3558895","url":null,"abstract":"The sparse–sparse matrix multiplication (SpMSpM) is a fundamental operation in various applications. Existing SpMSpM accelerators based on inner product (IP) and outer product (OP) suffer from low computational efficiency and high memory traffic due to inefficient index matching and merging overheads. Gustavson’s product (GP)-based accelerators mitigate some of these challenges but struggle with workload imbalance and irregular memory access patterns, limiting computational parallelism. To overcome these limitations, we propose a distribute-merge product (DMP), a novel SpMSpM dataflow that evenly distributes workloads across multiple computation streams and merges partial results efficiently. We design and implement DMP-based SpMSpM architecture (DMSA), incorporating four key techniques to fully exploit the parallelism of DMP and efficiently handle irregular memory accesses. Implemented on a Xilinx ZCU106 FPGA, DMSA achieves speedups of up to <inline-formula> <tex-math>$3.38times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.73times $ </tex-math></inline-formula> over two state-of-the-art FPGA-based SpMSpM accelerators while maintaining comparable hardware resource usage. In addition, compared to CPU and GPU implementations on an NVIDIA Jetson AGX Xavier, DMSA is <inline-formula> <tex-math>$4.96times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.53times $ </tex-math></inline-formula> faster while achieving <inline-formula> <tex-math>$6.67times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$2.33times $ </tex-math></inline-formula> better energy efficiency, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1858-1871"},"PeriodicalIF":2.8,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An 197-μJ/Frame Single-Frame Bundle Adjustment Hardware Accelerator for Mobile Visual Odometry 197 μ j /帧单帧束调整硬件加速器
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-22 DOI: 10.1109/TVLSI.2025.3557872
Cheng Nian;Xiaorui Mo;Weiyi Zhang;Fasih Ud Din Farrukh;Yushi Guo;Fei Chen;Chun Zhang
{"title":"An 197-μJ/Frame Single-Frame Bundle Adjustment Hardware Accelerator for Mobile Visual Odometry","authors":"Cheng Nian;Xiaorui Mo;Weiyi Zhang;Fasih Ud Din Farrukh;Yushi Guo;Fei Chen;Chun Zhang","doi":"10.1109/TVLSI.2025.3557872","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557872","url":null,"abstract":"This article presents an energy-efficient hardware accelerator for optimized bundle adjustment (BA) for mobile high-frame-rate visual odometry (VO). BA uses graph optimization techniques to optimize poses and landmarks and the applications are robot navigation, virtual reality (VR), and augmented reality (AR). Existing software implementations of BA optimization involve complex computational flows, numerical calculations, Lie group, and Lie algebra conversions. This poses challenges of slow computational speeds and high power consumption. A two-level reuse hardware architecture is proposed and implemented that efficiently updates the Jacobian matrix while reducing the field-programmable gate array (FPGA) hardware resources by 25%. A set of methodologies is proposed to quantify the errors caused by fixed-point systems during optimization. A fully pipelined architecture is implemented to increase computational speed while reducing hardware resources by 29%. This design features a parallel equation solver that improves processing speed by <inline-formula> <tex-math>$2times $ </tex-math></inline-formula> compared to conventional approaches. This article employs a single-frame local BA VO on the KITTI dataset and EuRoC dataset, achieving an average translational error of 0.75% and a rotational error of <inline-formula> <tex-math>$0.0028~^{circ } $ </tex-math></inline-formula>/m. The proposed hardware achieves a performance ranging from 188 to 345 frames/s in optimizing two main feature extraction methods with a maximum of 512 extracted feature points. Compared to state-of-the-art implementations, the accelerator achieved a minimum energy efficiency ratio of 11.6 mJ and <inline-formula> <tex-math>$191~mu $ </tex-math></inline-formula>J on the FPGA platform and application-specific integrated circuits (ASICs) platform, respectively. These improvements underscore the potential of FPGAs to enhance VO systems’ adaptability and efficiency in complex environments.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1872-1885"},"PeriodicalIF":2.8,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Switched-Based Slew Rate and Gain Boosting Parallel-Path Amplifier for Switched-Capacitor Applications 一种用于开关电容的开关型摆率和增益提升并联路径放大器
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-22 DOI: 10.1109/TVLSI.2025.3557467
Javad Bagheri Asli;Alireza Saberkari;Atila Alvandpour
{"title":"A Switched-Based Slew Rate and Gain Boosting Parallel-Path Amplifier for Switched-Capacitor Applications","authors":"Javad Bagheri Asli;Alireza Saberkari;Atila Alvandpour","doi":"10.1109/TVLSI.2025.3557467","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3557467","url":null,"abstract":"A parallel-path amplifier (PPA) incorporating a switched-based slew rate and gain boosting stage as a feed-forward path, in parallel with a linear amplifier is introduced in this brief as an alternative to conventional analog amplifiers to achieve a high accuracy through the linear path and high slewing through the assisted feed-forward path. The feed-forward path employs a pre-amplifier, hysteresis-detector, and differential charge pumps, while the linear path includes a recycling folded-cascode amplifier. An analysis is performed to study the amplifier’s settling error with and without the feed-forward path, and also the trade-off between the dead-zone width of the hysteresis detector and the amplifier’s settling speed. The assisted feed-forward path has improved the slew rate <inline-formula> <tex-math>$times 2.5$ </tex-math></inline-formula>–800 V/<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>s, effective GBW by 15%, and dc gain by 16 dB at the expense of adding <inline-formula> <tex-math>$187.5~mu $ </tex-math></inline-formula>A extra current consumption and <inline-formula> <tex-math>$1.25~mu $ </tex-math></inline-formula>m<sup>2</sup> extra silicon area. To prove the concept, the proposed amplifier is used as a multiplying digital-to-analog converter (MDAC) amplifier of an 8-bit pipeline analog-to-digital converter (ADC), and the ADC is fabricated in a 65-nm CMOS process. The results reveal that the spurious free dynamic range (SFDR) and signal-to-noise and distortion ratio (SNDR) performances are improved by 6–7 dB in the presence of the feed-forward path.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1799-1802"},"PeriodicalIF":2.8,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Wireless PHY With Adaptive OFDM and Multiarmed Bandit Learning on Zynq System-on-Chip 基于Zynq片上系统的自适应OFDM和多臂强盗学习增强无线PHY
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-14 DOI: 10.1109/TVLSI.2025.3528865
Neelam Singh;Sumit J. Darak
{"title":"Enhancing Wireless PHY With Adaptive OFDM and Multiarmed Bandit Learning on Zynq System-on-Chip","authors":"Neelam Singh;Sumit J. Darak","doi":"10.1109/TVLSI.2025.3528865","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3528865","url":null,"abstract":"In this work, we present an intelligent and reconfigurable wireless physical layer (PHY) that dynamically adjusts the transmission parameters for a given radio frequency (RF) environment. The proposed PHY is based on orthogonal frequency division multiplexing (OFDM) and can dynamically augment OFDM with a finite impulse response (FIR) low-pass filter to improve the out-of-band emissions (OOBE). To make these adaptations intelligently, we employ multiarmed bandit (MAB)-based online learning algorithms, specifically upper confidence bound with control variate (UCB-CV). UCB-CV enhances traditional UCB by incorporating additional information such as interference level and transmit power, allowing it to manage interference more effectively. These algorithms are integrated into the PHY of an FPGA-based OFDM transceiver on the Zynq system-on-chip (SoC), facilitating real-time decision-making based on side-channel interference and other parameters. Our comparative analysis highlights the enhanced performance of the UCB-CV algorithm over the traditional UCB in terms of reducing the bit-error rate (BER) and managing interference more effectively. Unlike the traditional UCB, UCB-CV leverages side information through a control variate approach, incorporating the coefficient of variation (CV) into reward estimation to better handle interference. Additionally, we underline the advantages of filtered-OFDM (FOFDM) compared to standard OFDM. Notably, FOFDM significantly reduces OOBE by 20–75 dBW/Hz and improves BER. In environments with high interference, UCB-CV achieves a throughput improvement of 29.54% compared to UCB.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1651-1664"},"PeriodicalIF":2.8,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Implementing Homomorphic Encryption-Based Logic Locking in System-On-Chip Designs 在片上系统设计中实现同态加密逻辑锁定
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-10 DOI: 10.1109/TVLSI.2025.3556241
Ziyang Ye;Makoto Ikeda
{"title":"Implementing Homomorphic Encryption-Based Logic Locking in System-On-Chip Designs","authors":"Ziyang Ye;Makoto Ikeda","doi":"10.1109/TVLSI.2025.3556241","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3556241","url":null,"abstract":"This study presents a logic-locking scheme based on the binary ring learning with error (bin-RLWE) algorithm, implemented in a reduced instruction set computer-five (RISC-V) system-on-chip (SoC) design. Unlike traditional logic-locking methods that require providing users with raw locking parameters, the proposed approach secures critical logic paths in the privilege switching process without exposing these sensitive parameters. The implemented locking module itself consumes 3519 lookup tables (LUTs) and 2645 registers, leading to an overall overhead of 6.0% in LUTs and 6.9% in registers compared to the baseline system. The unlock process requires about <inline-formula> <tex-math>$2.6~mu $ </tex-math></inline-formula>s, introducing moderate performance impact and primarily affecting system-level operations while preserving user-level computational efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2049-2053"},"PeriodicalIF":2.8,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-Time Driver Monitoring: Implementing FPGA-Accelerated CNNs for Pose Detection 实时驾驶员监控:实现姿态检测的fpga加速cnn
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-10 DOI: 10.1109/TVLSI.2025.3554880
Minjoon Kim;Jaehyuk So
{"title":"Real-Time Driver Monitoring: Implementing FPGA-Accelerated CNNs for Pose Detection","authors":"Minjoon Kim;Jaehyuk So","doi":"10.1109/TVLSI.2025.3554880","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3554880","url":null,"abstract":"As autonomous driving technology advances at an unprecedented pace, drivers are experiencing greater freedom within their vehicles, which accelerates the development of various intelligent systems to support safe and more efficient driving. These intelligent systems provide interactive applications between the vehicle and the driver, utilizing driver behavior analysis (DBA). A key performance indicator is real-time driver monitoring quality, as it directly impacts both safety and convenience in vehicle operation. In order to achieve real-time interaction, an image processing speed exceeding 30 frames/s and a delay time (latency) below 100 ms are generally required. However, expensive devices are often necessary to support this with software. Therefore, this article presents an algorithm and implementation results for immediate in-vehicle DBA through field-programmable gate array (FPGA)-based high-speed upper body-pose estimation. First, we define the 11 key points related to the driver’s pose and gaze and model a convolutional neural network (CNN) architecture that can quickly detect them. The proposed algorithm utilizes regeneration and retraining through layer reduction based on the residual-CNN model. In addition, the algorithm presents the results of its implementation at the register transfer level (RTL) level of the VCU118 FPGA and demonstrates simulation results of 34.7 frames/s and a delay time of 75.3 ms. Lastly, we discuss the results of linking a demo application and creating a vehicle testbed to experiment with the driver–vehicle interaction (DVI) system. A developed FPGA platform is implemented to process camera image input in real time. It reliably supports detected pose and gaze results at 30 frames/s via Ethernet. It also presents results that verify its application in screen control and driver monitoring systems.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1848-1857"},"PeriodicalIF":2.8,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Xpikeformer: Hybrid Analog-Digital Hardware Acceleration for Spiking Transformers Xpikeformer:用于脉冲变压器的混合模拟-数字硬件加速
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-08 DOI: 10.1109/TVLSI.2025.3552534
Zihang Song;Prabodh Katti;Osvaldo Simeone;Bipin Rajendran
{"title":"Xpikeformer: Hybrid Analog-Digital Hardware Acceleration for Spiking Transformers","authors":"Zihang Song;Prabodh Katti;Osvaldo Simeone;Bipin Rajendran","doi":"10.1109/TVLSI.2025.3552534","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3552534","url":null,"abstract":"The integration of neuromorphic computing and transformers through spiking neural networks (SNNs) offers a promising path to energy-efficient sequence modeling, with the potential to overcome the energy-intensive nature of the artificial neural network (ANN)-based transformers. However, the algorithmic efficiency of SNN-based transformers cannot be fully exploited on GPUs due to architectural incompatibility. This article introduces Xpikeformer, a hybrid analog-digital hardware architecture designed to accelerate SNN-based transformer models. The architecture integrates analog in-memory computing (AIMC) for feedforward and fully connected layers, and a stochastic spiking attention (SSA) engine for efficient attention mechanisms. We detail the design, implementation, and evaluation of Xpikeformer, demonstrating significant improvements in energy consumption and computational efficiency. Through image classification tasks and wireless communication symbol detection tasks, we show that Xpikeformer can achieve inference accuracy comparable to the GPU implementation of ANN-based transformers. Evaluations reveal that Xpikeformer achieves a <inline-formula> <tex-math>$13times $ </tex-math></inline-formula> reduction in energy consumption at approximately the same throughput as the state-of-the-art (SOTA) digital accelerator for ANN-based transformers. In addition, Xpikeformer achieves up to <inline-formula> <tex-math>$1.9times $ </tex-math></inline-formula> energy reduction compared to the optimal digital ASIC projection of SOTA SNN-based transformers.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1596-1609"},"PeriodicalIF":2.8,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flex-PE: Flexible and SIMD Multiprecision Processing Element for AI Workloads Flex-PE:用于人工智能工作负载的灵活和SIMD多精度处理元件
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-07 DOI: 10.1109/TVLSI.2025.3553069
Mukul Lokhande;Gopal Raut;Santosh Kumar Vishvakarma
{"title":"Flex-PE: Flexible and SIMD Multiprecision Processing Element for AI Workloads","authors":"Mukul Lokhande;Gopal Raut;Santosh Kumar Vishvakarma","doi":"10.1109/TVLSI.2025.3553069","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3553069","url":null,"abstract":"The rapid evolution of artificial intelligence (AI) models, from deep neural networks (DNNs) to transformers/large-language models (LLMs), demands flexible hardware solutions to meet diverse execution needs across edge and cloud platforms. Existing accelerators lack unified support for multiprecision arithmetic and runtime-configurable activation functions (AFs). This work proposes Flex-PE, a single instruction, multiple data (SIMD)-enabled multiprecision processing element that efficiently integrates multiply-and-accumulate operations with configurable AFs using unified hardware, including Sigmoid, Tanh, ReLU, and SoftMax. The proposed design achieves throughput improvements of up to <inline-formula> <tex-math>$16times $ </tex-math></inline-formula> FxP4, <inline-formula> <tex-math>$8times $ </tex-math></inline-formula> FxP8, <inline-formula> <tex-math>$4times $ </tex-math></inline-formula> FxP16, and <inline-formula> <tex-math>$1times $ </tex-math></inline-formula> FxP32, with maximum hardware efficiency for both iterative and pipelined architectures. An area-efficient iterative Flex-PE-based SIMD systolic array reduces DMA reads by up to <inline-formula> <tex-math>$62times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$371times $ </tex-math></inline-formula> for input feature maps and weight filters in VGG-16, achieving 8.42 GOPS/W energy efficiency with minimal accuracy loss (<2%). Flex-PE scales from 4-bit edge inference to FxP8/16/32, supporting edge and cloud high-performance computing (HPC) while providing high-performance adaptable AI hardware with optimal precision, throughput, and energy efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 6","pages":"1610-1623"},"PeriodicalIF":2.8,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 28-nm Cascode Current Mirror-Based Inconsistency-Free Charging-and-Discharging SRAM-CIM Macro for High-Efficient Convolutional Neural Networks 基于28纳米Cascode电流镜的高效卷积神经网络无不一致性充放电SRAM-CIM宏
IF 2.8 2区 工程技术
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-07 DOI: 10.1109/TVLSI.2025.3552641
Chunyu Peng;Jiating Guo;Shengyuan Yan;Yiming Wei;Xiaohang Chen;Wenjuan Lu;Chenghu Dai;Zhiting Lin;Xiulong Wu
{"title":"A 28-nm Cascode Current Mirror-Based Inconsistency-Free Charging-and-Discharging SRAM-CIM Macro for High-Efficient Convolutional Neural Networks","authors":"Chunyu Peng;Jiating Guo;Shengyuan Yan;Yiming Wei;Xiaohang Chen;Wenjuan Lu;Chenghu Dai;Zhiting Lin;Xiulong Wu","doi":"10.1109/TVLSI.2025.3552641","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3552641","url":null,"abstract":"Computing-in-memory (CIM) is an emerging approach to alleviate the von Neumann bottleneck and enhance energy efficiency and throughput. This brief introduces a 16-Kb static random access memory (SRAM) CIM macro for convolutional neural networks (CNNs), featuring a cascode current mirror-based inconsistency-free computing circuits (CICCs). The bias voltage of CICC is provided by a cascode current mirror (CCM) circuit. The proposed architecture improves the consistency and linearity of bitline (BL) charge and discharge rates in the analog current domain, enhancing computational accuracy. Additionally, the charge and discharge on the BLs represent the positive or negative calculation result, eliminating the need for extra encoding and logic circuits to handle sign bits. The SRAM-CIM macro achieves an energy efficiency of 59.1–134.0 TOPS/W and a throughput of 0.41 TOPS in a 28-nm CMOS technology, and the estimated inference accuracy on MNIST and CIFAR-10 datasets is 96.5% and 91.4%, respectively, with 5-bit input precision and 1-bit weight precision.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"2044-2048"},"PeriodicalIF":2.8,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信