{"title":"RDDM: A Rate-Distortion Guided Diffusion Model for Learned Image Compression Enhancement","authors":"Sanxin Jiang;Jiro Katto;Heming Sun","doi":"10.1109/JETCAS.2025.3563228","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3563228","url":null,"abstract":"Currently, denoising diffusion probability models (DDPM) have achieved significant success in various image generation tasks, but their application in image compression, especially in the context of learned image compression (LIC), is quite limited. In this study, we introduce a rate-distortion (RD) guided diffusion model, referred to as RDDM, to enhance the performance of LIC. In RDDM, LIC is treated as a lossy codec function constrained by RD, dividing the input image into two parts through encoding and decoding operations: the reconstructed image and the residual image. The construction of RDDM is primarily based on two points. First, RDDM treats diffusion models as repositories of image structures and textures, built using extensive real-world datasets. Under the guidance of RD constraints, it extracts and utilizes the necessary structural and textural priors from these repositories to restore the input image. Second, RDDM employs a Bayesian network to progressively infer the input image based on the reconstructed image and its codec function. Additionally, our research reveals that RDDM’s performance declines when its codec function does not match the reconstructed image. However, using the highest bitrate codec function minimizes this performance drop. The resulting model is referred to as <inline-formula> <tex-math>$text{RDDM}^{star }$ </tex-math></inline-formula>. The experimental results indicate that both RDDM and <inline-formula> <tex-math>$text{RDDM}^{star }$ </tex-math></inline-formula> can be applied to various architectures of LICs, such as CNN, Transformer, and their hybrid. They can significantly improve the fidelity of these codecs while maintaining or even enhancing perceptual quality to some extent.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"186-199"},"PeriodicalIF":3.7,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti
{"title":"A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU","authors":"Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti","doi":"10.1109/JETCAS.2025.3562734","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3562734","url":null,"abstract":"Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a <inline-formula> <tex-math>$24times 8$ </tex-math></inline-formula> systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (<inline-formula> <tex-math>$121times $ </tex-math></inline-formula> speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm<sup>2</sup>, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to <inline-formula> <tex-math>$10.8times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.11times $ </tex-math></inline-formula>, respectively, while reducing their energy consumption by up to <inline-formula> <tex-math>$10.8times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.29times $ </tex-math></inline-formula>. These enhancements translate into a <inline-formula> <tex-math>$1.58times $ </tex-math></inline-formula> increase in throughput (310 GOPS at 0.8 V) and a <inline-formula> <tex-math>$1.42times $ </tex-math></inline-formula> improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"200-216"},"PeriodicalIF":3.7,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siqi Cai;Gang Wang;Wenjie Li;Dongxu Lyu;Guanghui He
{"title":"Adaptive Two-Range Quantization and Hardware Co-Design for Large Language Model Acceleration","authors":"Siqi Cai;Gang Wang;Wenjie Li;Dongxu Lyu;Guanghui He","doi":"10.1109/JETCAS.2025.3562937","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3562937","url":null,"abstract":"Large language models (LLMs) face high computational and memory demands. While prior studies have leveraged quantization to reduce memory requirements, critical challenges persist: unaligned memory accesses, significant quantization errors when handling outliers that span larger quantization ranges, and the increased hardware overhead associated with processing high-bit-width outliers. To address these issues, we propose a quantization algorithm and hardware architecture co-design for efficient LLM acceleration. Algorithmically, a grouped adaptive two-range quantization (ATRQ) with an in-group embedded identifier is proposed to encode outliers and normal values in distinct ranges, achieving hardware-friendly aligned memory access and reducing quantization errors. From a hardware perspective, we develop a low-overhead ATRQ decoder and an outlier-bit-split processing element (PE) to reduce the hardware overhead associated with high-bit-width outliers, effectively leveraging their inherent sparsity. To support mixed-precision computation and accommodate diverse dataflows during the prefilling and decoding phases, we design a reconfigurable local accumulator that mitigates the overhead associated with additional adders. Experimental results show that the ATRQ-based accelerator outperforms existing solutions, achieving up to <inline-formula> <tex-math>$2.48times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$2.01times $ </tex-math></inline-formula> energy reduction in LLM prefilling phase, and <inline-formula> <tex-math>$1.87times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$2.03times $ </tex-math></inline-formula> energy reduction in the decoding phase, with superior model performance under post-training quantization.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"272-284"},"PeriodicalIF":3.7,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Overview of Neural Rendering Accelerators: Challenges, Trends, and Future Directions","authors":"Junha Ryu;Hoi-Jun Yoo","doi":"10.1109/JETCAS.2025.3561777","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3561777","url":null,"abstract":"Rapid advancements in neural rendering have revolutionized the fields of augmented reality (AR) and virtual reality (VR) by enabling photorealistic 3D modeling and rendering. However, deploying neural rendering on edge devices presents significant challenges due to computational complexity, memory inefficiencies, and energy constraints. This paper provides a comprehensive overview of neural rendering accelerators, identifying the major hardware inefficiencies across sampling, positional encoding, and multi-layer perception (MLP) stages. We explore hardware-software co-optimization techniques that address these challenges and provide a summary for in-depth analysis. Additionally, emerging trends like 3D Gaussian Splatting (3DGS) and hybrid rendering approaches are briefly introduced, highlighting their potential to improve rendering quality and efficiency. By presenting a unified analysis of challenges, solutions, and future directions, this work aims to guide the development of next-generation neural rendering accelerators, especially for resource-constrained environments.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"299-311"},"PeriodicalIF":3.7,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10967345","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LightRot: A Light-Weighted Rotation Scheme and Architecture for Accurate Low-Bit Large Language Model Inference","authors":"Sangjin Kim;Yuseon Choi;Jungjun Oh;Byeongcheol Kim;Hoi-Jun Yoo","doi":"10.1109/JETCAS.2025.3558300","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3558300","url":null,"abstract":"As large language models (LLMs) continue to demonstrate exceptional capabilities across various domains, the challenge of achieving energy-efficient and accurate inference becomes increasingly critical. This work presents LightRot, a lightweight rotation scheme and dedicated hardware accelerator designed for low-bit LLM inference. The proposed architecture integrates Grouped Local Rotation (GLR) and Outlier Direction Aligning (ODA) algorithms with a hierarchical Fast Hadamard Transform (FHT)-based rotation unit to address key challenges in low-bit quantization, including the energy overhead of rotation operations. The proposed accelerator, implemented in a 28nm CMOS process, achieves a peak energy efficiency of 27.4TOPS/W for 4-bit inference, surpassing prior state-of-the-art designs. Unlike conventional approaches that rely on higher-precision inference or evaluate on basic language modeling tasks like GPT-2, LightRot is optimized for advanced models such as LLaMA2-13B and LLaMA3-8B. Its performance is further validated on MT-Bench, demonstrating robust applicability to real-world conversational scenarios and redefining benchmarks for chat-based AI systems. By synergizing algorithmic innovations and hardware efficiency, this work sets a new paradigm for scalable, low-bit LLM inference, paving the way for sustainable AI advancements.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"231-243"},"PeriodicalIF":3.7,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenjie Li;Gang Wang;Dongxu Lyu;Ningyi Xu;Guanghui He
{"title":"Efficient Hardware Architecture Design for Rotary Position Embedding of Large Language Models","authors":"Wenjie Li;Gang Wang;Dongxu Lyu;Ningyi Xu;Guanghui He","doi":"10.1109/JETCAS.2025.3556443","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3556443","url":null,"abstract":"Due to the substantial demands of storage and computation imposed by large language models (LLMs), there has been a surge of research interest in their hardware acceleration. As a technique involving non-linear operations, rotary position embedding (RoPE) has been adopted by some recently released LLMs. However, there is currently no reported research on its hardware design. This paper, for the first time, presents an efficient hardware architecture design for RoPE of LLMs. We first explore the similarities between RoPE and the coordinate rotation digital computer (CORDIC) algorithm, while also considering the commonly used quantization scheme for LLMs. Additionally, we propose a hardware-friendly solution to address the issue of excessively large input angle ranges. Then we present a CORDIC-based approximation for RoPE and develop a hardware architecture for it. The experimental results demonstrate that our design can save up to 45.7% area cost and 31.0% power consumption when compared with the fixed-point counterpart, while maintaining almost the same model performance. Compared to the straightforward implementation using floating-point arithmetic, our design can reduce up to 91.4% area cost and 88.9% power consumption, with negligible performance loss.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"244-257"},"PeriodicalIF":3.7,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"F3: An FPGA-Based Transformer Fine-Tuning Accelerator With Flexible Floating Point Format","authors":"Zerong He;Xi Jin;Zhongguang Xu","doi":"10.1109/JETCAS.2025.3555970","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3555970","url":null,"abstract":"Transformers have demonstrated remarkable success across various deep learning tasks. However, their inference and fine-tuning require substantial computation and memory resources, posing challenges for existing hardware platforms, particularly resource-constrained edge devices. To address these limitations, we propose F<sup>3</sup>, an FPGA-based accelerator for transformer fine-tuning. To reduce computation and memory overhead, this paper proposes a flexible floating point (FFP) format which consumes fewer resources than traditional floating-point formats of the same bitwidth. We adapt low-rank adaptation to FFP format and propose a fine-tuning strategy named LR-FFP which reduces the number of trainable parameters without compromising fine-tuning accuracy. At the hardware level, we design specialized processing elements (PEs) for the FFP format. The PE maximizes the utilization of DSP resources, enabling a single DSP to perform two multiply-accumulate operations per cycle. The PEs are organized into a systolic array (SA) to efficiently handle general matrix multiplication during fine-tuning. Through theoretical analysis and experimental evaluation, we determine the optimal dataflow and SA parameters to balance performance and resource consumption. We implement the architecture on the Xilinx VCU128 FPGA platform and F<sup>3</sup> achieves a performance of 8.2 TFlops at 250 MHz. Compared with CPU and GPU implementations, F<sup>3</sup> achieves speedups of <inline-formula> <tex-math>$15.22 times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$3.44 times $ </tex-math></inline-formula>, respectively, and energy efficiency improvements of <inline-formula> <tex-math>$70.52 times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$9.44 times $ </tex-math></inline-formula>.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"258-271"},"PeriodicalIF":3.7,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Die-Level Transformation From 2D Shuttle Chips to 3D-IC With TSV for Advanced Rapid Prototyping Methodology With Meta Bonding","authors":"Takafumi Fukushima;Tetsu Tanaka;Mitsumasa Koyanagi","doi":"10.1109/JETCAS.2025.3572003","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3572003","url":null,"abstract":"3D-IC technology, it may be more appropriate to refer to this as TSV (Through-Si Via) formation technology, has been maturing year by year and is increasingly utilized in advanced semiconductor devices, such as 3D CIS (CMOS Image Sensor), HBM (High-Bandwidth Memory), and SRAM-on-CPU (named 3D V-Cache) devices. However, the initial development costs remain prohibitively high, largely due to the substantial investment required for TSV formation at the wafer level. Meanwhile, conventional System on a Chips (SoCs) are transitioning from Fin-FET to GAA (Gate All Around) using the latest beyond 3-nm technology nodes, incorporating extreme ultraviolet (EUV) and other cutting-edge techniques. Meanwhile, the academic community is establishing an environment conducive to the utilization of nodes ranging from legacy 180 nm to 7 nm, making it feasible for designers to obtain 2D IC chips with their novel architectures at a reduced cost. Despite these advancements, foundry shuttle services employing TSV are still almost impossible to utilize, and performing proof of principle and functional verification using 3D-ICs remains extremely challenging. This article introduces recent advancements in technology that can transform 2D-ICs into 3D-ICs using shuttle chips for Multi-Project Wafers (MPWs) at a small scale to a large scale. This article mainly focuses on discussing the facilitation of die-level short-TAT (turnaround time) 3D-IC fabrication with key elemental technologies of multi-chip thinning and TSV/microbump formation. In addition, the effectiveness of Meta Bonding, such as fine-pitch microbump and direct/hybrid bonding, is described for future high-performance 3D-IC prototyping.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 3","pages":"415-426"},"PeriodicalIF":3.8,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11007580","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145060958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GenPolar: Generative AI-Aided Complexity Reduction for Polar SCL Decoding","authors":"Yutai Sun;Jingyi Chen;Yuqing Ren;Houren Ji;Yongming Huang;Xiaohu You;Chuan Zhang","doi":"10.1109/JETCAS.2025.3561330","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3561330","url":null,"abstract":"The CRC-aided successive cancellation list (CA-SCL) decoding algorithm for polar codes has gained widespread adoption thanks to its outstanding performance. However, with the evolution of 6G technologies, the high complexity of CA-SCL decoding poses a challenge in meeting growing performance requirements. Consequently, it is crucial to devise strategies that reduce this complexity without compromising error rates. Current efforts to mitigate the complexity mainly depend on harnessing <monospace>special nodes</monospace> associated with the code construction sequences, such as Fast-SCL decoding. However, these strategies suffer from redundant complexity due to ill-suited construction sequences and unnecessary sorting operations within special nodes. Addressing this issue, this paper proposes a hardware-friendly and GenAI-aided complexity reduction approach for Fast-SCL decoding, named GenPolar. This approach involves two-step optimization techniques: 1) <italic>Transformer encoder models</i> for generating polar construction sequences, and 2) <italic>a sorting entropy based method</i> for sorting reduction. These two-step techniques result in reduced complexity with negligible performance loss. For polar codes of length-1024 with code rates of 0.25, 0.50, and 0.75, GenPolar achieves latency reductions of 20.6%, 29.8%, and 40.6%, respectively. Even benchmarking against the reduced-complexity version of Fast-SCL decoding, the relative gains are 14.0%, 17.8%, and 22.3%, respectively. It should be noted that the immediate application is not limited to Fast-SCL decoding but also extends to other node-based SCL decoding algorithms like SSCL-SPC and SR-SCL.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"312-324"},"PeriodicalIF":3.7,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11007206","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Herglotz;Daniel Palomino;Olivier Le Meur;C.-C. Jay Kuo
{"title":"Editorial on Circuits and Systems for Green Video Communications","authors":"Christian Herglotz;Daniel Palomino;Olivier Le Meur;C.-C. Jay Kuo","doi":"10.1109/JETCAS.2025.3541767","DOIUrl":"https://doi.org/10.1109/JETCAS.2025.3541767","url":null,"abstract":"","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 1","pages":"1-3"},"PeriodicalIF":3.7,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10924431","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}