{"title":"ReTern: Exploiting Natural Redundancy and Sign Transformations for Enhanced Fault Tolerance in Compute-in-Memory-Based Ternary LLMs","authors":"Akul Malhotra;Sumeet Kumar Gupta","doi":"10.1109/TVLSI.2025.3585043","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585043","url":null,"abstract":"Ternary large language models (LLMs), which use ternary precision weights and 8-bit activations, have demonstrated competitive performance while significantly reducing the high computational and memory requirements of full-precision LLMs. The energy efficiency and performance of ternary LLMs can be further improved by deploying them on ternary computing-in-memory (TCiM) accelerators, thereby alleviating the von-Neumann bottleneck. However, TCiM accelerators are prone to memory stuck-at faults (SAFs) leading to degradation in model accuracy. This is particularly severe for LLMs due to their low weight sparsity. To boost SAF tolerance of TCiM accelerators, we propose ReTern that is based on 1) fault-aware sign transformations (FASTs) and 2) TCiM bitcell reprogramming exploiting their natural redundancy. The key idea is to use FAST to minimize computation errors due to SAFs in +1/−1 weights, while the natural bitcell redundancy is exploited to target SAFs in 0 weights (zero-fix). Our experiments on BitNet b1.58 700M and 3B ternary LLMs show that our technique furnishes significant fault tolerance, notably ~35% reduction in perplexity on the Wikitext dataset in the presence of faults. These benefits come at the cost of <3%, <7%, and <1% energy, latency, and area overheads, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2518-2527"},"PeriodicalIF":3.1,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-Oriented Design and Efficient Implementation of a Geometrically Tunable Multiscroll Conservative Chaotic System Without Equilibrium Points","authors":"Yerui Guang;Qun Ding;Dongxu Liu","doi":"10.1109/TVLSI.2025.3580266","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3580266","url":null,"abstract":"Although multiscroll conservative chaotic systems exhibit rich dynamical characteristics and hold great potential for secure communications, existing designs generally suffer from limited controllability and low hardware implementation efficiency. To address these challenges, this article proposes a novel 4-D multiscroll conservative chaotic system based on a nonlinear feedback structure constructed using the floor function. This original approach simplifies the system’s logical structure, facilitating efficient hardware modeling while enabling flexible control over the number, amplitude, and spatial distribution of scrolls in 3-D space. The system’s high complexity and coexisting behaviors are validated through dynamical analyses, including equilibrium point analysis, Poincaré sections, and Lyapunov exponents (LEs). To achieve efficient deployment of the chaotic system on field-programmable gate array (FPGA) platforms, this article first simplifies the hardware implementation logic of the feedback structure through the design of an algorithmic model based on bitwise operations. Subsequently, precise control of the system’s module signals is achieved through a finite state machine (FSM) design. The results of the resource comparison analysis indicate that the proposed model achieves a high throughput of 10.08 Gbps while consuming only 1051 look-up tables (LUTs). The lower energy efficiency is 0.0264 mW/Mbps. Hardware-software co-simulation and oscilloscope visual output confirm the numerical precision and hardware feasibility of the proposed system. Finally, this system is integrated with the ZUC stream cipher to construct a novel encryption core, enabling asynchronous ciphertext transmission as well as encryption and decryption functions, thereby demonstrating its potential for secure hardware applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2528-2541"},"PeriodicalIF":3.1,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sebastian Haas;Christopher Dunkel;Friedrich Pauls;Mattis Hasler;Yogesh Verma;Nilanjana Das;Michael Raitza
{"title":"A Secure-by-Design Hardware/Operating System as a Substrate for Trustworthy Computing","authors":"Sebastian Haas;Christopher Dunkel;Friedrich Pauls;Mattis Hasler;Yogesh Verma;Nilanjana Das;Michael Raitza","doi":"10.1109/TVLSI.2025.3579484","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3579484","url":null,"abstract":"Nowadays, digital devices like sensors, cell phones, and home servers are deeply embedded in our world to make our daily lives easier. Since we heavily rely on these systems, it is crucial to guarantee their correct functionality and to ensure security and privacy properties. As systems become increasingly complex, it is difficult to maintain security since it necessitates a thorough understanding of all functionalities in hardware and software. Complexity may lead to vulnerabilities that malicious components can exploit. These components can compromise security features provided by the processing cores and the operating system (OS), jeopardizing the overall trustworthiness of the system. In this article, we provide a secure-by-default hardware/OS co-design to build a substrate for trustworthy computing in digital devices. The design is based on a tiled architecture that can integrate untrusted hardware components. Instead of relying on isolation mechanisms of potentially malicious components, isolation is achieved by dedicated and independent hardware components called trusted communication units (TCUs). By keeping the attack surface small and isolating all components by default, malicious hardware and software are restricted in access permissions and, hence, cannot easily break the system’s security. We implemented a TCU-based multiprocessor architecture in a silicon research chip, called Masur23, and ran transfer workloads and selected portions of the microkernel-based OS M<sup>3</sup>. Our measurements demonstrate the feasibility of such a hardware/OS co-design for trustworthy computing. Compared to the entire chip implementation, security features require minimal latency, area, and power consumption overhead.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 10","pages":"2862-2872"},"PeriodicalIF":3.1,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic Belief Propagation-Based Iterative Detection and Decoding for MIMO Systems","authors":"Muhao Li;Houren Ji;Xiaosi Tan;Chuan Zhang","doi":"10.1109/TVLSI.2024.3477963","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3477963","url":null,"abstract":"In this brief, a stochastic belief propagation (BP)-based iterative detection and decoding (IDD) for multiple-input and multiple-output (MIMO) system is proposed. We modify the algorithm of BP detection to make it more suitable for stochastic computation and enable the soft message to be transmitted between the detector and decoder in the format of stochastic sequences. Through IDD, the required number of iterations and quantization precision for the detector will decrease. By sharing the stochastic number generator, the hardware complexity of both the detector and decoder can be reduced. Hardware architectural optimizations and the corresponding implementation are also given, and we can implement <inline-formula> <tex-math>$64times 32$ </tex-math></inline-formula>, four-QAM MIMO system with (128, 64) polar codes with <inline-formula> <tex-math>$1.283~text {mm}^{2}$ </tex-math></inline-formula> area consumption. Compared with other detector, the hardware efficiency can be improved by 7.8 times.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2324-2328"},"PeriodicalIF":2.8,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information","authors":"","doi":"10.1109/TVLSI.2025.3579662","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3579662","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"C2-C2"},"PeriodicalIF":2.8,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11059982","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information","authors":"","doi":"10.1109/TVLSI.2025.3579664","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3579664","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"C3-C3"},"PeriodicalIF":2.8,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11059983","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144519445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Memory BIST With an Optimized RTL-BIST IP Core: A Low-Power, High-Fault-Coverage Approach","authors":"Ming-Yi Lin;Wei-Kuan Chiang;Chin-Hung Wang","doi":"10.1109/TVLSI.2025.3581296","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3581296","url":null,"abstract":"The increasing density of static random access memory (SRAM) in modern system-on-chip (SoC) architectures has intensified the need for efficient built-in self-test (BIST) solutions to ensure fault detection and repair. This article presents an optimized register transfer level (RTL)-BIST intellectual property core (IP core) that integrates a novel March mSR+ algorithm, providing a low-power, high-fault-coverage approach to embedded memory testing. Developed using high-level synthesis (HLS), the proposed framework enhances test efficiency while minimizing hardware complexity. Experimental results on field-programmable gate array (FPGA) implementations demonstrate that the March mSR+ algorithm achieves an 88.89% fault coverage while reducing power consumption compared with conventional March-based testing methods. These findings validate the effectiveness of the RTL-BIST framework in improving memory reliability for artificial intelligence (AI), high-performance computing (HPC), and safety-critical applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2556-2569"},"PeriodicalIF":3.1,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kun Li;Xiangyu Hao;Zhenguo Ma;Feng Yu;Bo Zhang;Qianjian Xing
{"title":"A Fast Floating-Point Multiply–Accumulator Optimized for Sparse Linear Algebra on FPGAs","authors":"Kun Li;Xiangyu Hao;Zhenguo Ma;Feng Yu;Bo Zhang;Qianjian Xing","doi":"10.1109/TVLSI.2025.3578619","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578619","url":null,"abstract":"This brief presents a pipelined floating-point Multiply–Accumulator (FPMAC) architecture designed to accelerate sparse linear algebra operations. By designing a lookup-table-based 5–3 carry-save adder (CSA) and combining it with a 3–2 CSA, the proposed design minimizes the critical path and boosts operational speed. Moreover, the proposed architecture takes advantage of data characteristics in sparse linear algebra to displace the shift unit in the critical accumulation loop, further increasing the throughput rate. In addition, the integration of a lookup-table-based leading-zero anticipator (LZA) enhances normalization efficiency. Experimental results show that, compared with reported FPMAC designs, the proposed architecture may achieve a significantly higher maximum clock frequency for single-precision floating-point operations.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2592-2596"},"PeriodicalIF":3.1,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingzhou Li;Fangfei Yu;Mingyuan Ma;Wei Liu;Yuhan Wang;Hualin Wu;Hu He
{"title":"RISC-V-Based GPGPU With Vector Capabilities for High-Performance Computing","authors":"Jingzhou Li;Fangfei Yu;Mingyuan Ma;Wei Liu;Yuhan Wang;Hualin Wu;Hu He","doi":"10.1109/TVLSI.2025.3574427","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3574427","url":null,"abstract":"General-purpose graphics processing units (GPGPUs) have become a leading platform for accelerating modern compute-intensive applications, such as large language models and generative artificial intelligence (AI). However, the lack of advanced open-source GPGPU microarchitectures has hindered high-performance research in this area. In this article, we present Ventus, a high-performance open-source GPGPU implementation built upon the RISC-V architecture with vector extension [RISC-V vector (RVV)]. Ventus introduces customized instructions and a comprehensive software toolchain to optimize performance. We deployed the design on a field programmable gate array (FPGA) platform consisting of 4 Xilinx VU19P devices, scaling up to 16 streaming multiprocessors (SMs) and supporting 256 warps. Experimental results demonstrate that Ventus exhibits key performance features comparable to commercial GPGPUs, achieving an average of 83.9% instruction reduction and 87.4% cycle per instruction (CPI) improvement over the leading open-source alternatives. Under 4-, 8-, and 16-thread configurations, Ventus maintains robust instruction per cycle (IPC) performance with values of 0.47, 0.40, and 0.32, respectively. In addition, the tensor core of Ventus attains an extra average reduction of 69.1% in instruction count and a 68.4% cycle reduction ratio when running AI-related workloads. These findings highlight Ventus as a promising solution for future high-performance GPGPU research and development, offering a robust open-source alternative to proprietary solutions. Ventus can be found on <uri>https://github.com/THU-DSP-LAB/ventus-gpgpu</uri>","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2239-2251"},"PeriodicalIF":2.8,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Sample-and-Hold-Based 453-ps True Time Delay Circuit With a Wide Bandwidth of 0.5–2.5 GHz in 65-nm CMOS","authors":"Chuanjie Chen;Xiangyu Meng;Wang Xie;Baoyong Chi","doi":"10.1109/TVLSI.2025.3578959","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578959","url":null,"abstract":"Delay solutions applied to high frequencies typically involve switched transmission lines or all-pass filters. These solutions often suffer from significant insertion loss and drastic gain variations at high frequencies, along with poor delay flatness. In this work, we have designed a delay circuit that can be applied to high frequencies, featuring excellent delay flatness, good delay resolution, and a wide bandwidth. In this design, a multistage cascaded sampling circuit is used to generate delays. By introducing differential clocks or three-phase clocks, simple coarse delay or fine delay can be achieved. The measurement results show that the sample and hold circuit achieves a delay accuracy of 17.5 ps and a delay range of 453 ps within 0.5–2.5 GHz, with a gain of −2.7 to 2 dB and a gain variation of ±0.85 dB, a delay variation less than 7.5 ps, a power consumption of 111 mW, and a core area of 0.137 mm<sup>2</sup>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2344-2348"},"PeriodicalIF":2.8,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144704918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}