{"title":"Identifying Optimal Workload Offloading Partitions for CPU-PIM Graph Processing Accelerators","authors":"Sheng Xu;Chun Li;Le Luo;Wu Zhou;Liang Yan;Xiaoming Chen","doi":"10.1109/TVLSI.2025.3526201","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3526201","url":null,"abstract":"The integrated architecture that features both in-memory logic and host processors, or so-called “processing-in-memory” (PIM) architecture, is an emerging and promising solution to bridge the performance gap between the memory and host processors. In spite of the considerable potential of PIM, the workload offloading policy, which partitions the program and determines where code snippets are executed, is still a main challenge in PIM. In order to determine the best PIM offloading partitions, existing methods require in-depth program profiling to create the control flow graph (CFG) and then transform it into a graph-cut problem. These CFG-based solutions depend on detailed profiling of a crucial element, the execution time of basic blocks, to accurately assess the benefits of PIM offloading. The issue is that these execution times can change significantly in PIM, leading to inaccurate offloading decisions. To tackle this challenge, we present a novel PIM workload offloading framework called “RDPIM” for CPU-PIM graph processing accelerators, which systematically considers the variations in the execution time of basic blocks. By analyzing the relationship between data dependencies among workloads and the connectivity of input graphs, we identified three key features that can lead to variations in execution time. We developed a novel reuse distance (RD)-based model to predict the exact performance of basic blocks for optimal offloading decisions. We evaluate RDPIM using real-world graphs and compare it with some state-of-the-art PIM offloading approaches. Experiments have demonstrated that our method achieves an average speedup of <inline-formula> <tex-math>$2times $ </tex-math></inline-formula> compared to CPU-only executions and up to <inline-formula> <tex-math>$1.6times $ </tex-math></inline-formula> compared to state-of-the-art PIM offloading schemes.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"1053-1064"},"PeriodicalIF":2.8,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143675672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"M2-ViT: Accelerating Hybrid Vision Transformers With Two-Level Mixed Quantization","authors":"Yanbiao Liang;Huihong Shi;Zhongfeng Wang","doi":"10.1109/TVLSI.2024.3525184","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3525184","url":null,"abstract":"Although vision transformers (ViTs) have achieved significant success, their intensive computations and substantial memory overheads challenge their deployment on edge devices. To address this, efficient ViTs have emerged, typically featuring convolution-transformer hybrid architectures to enhance both accuracy and hardware efficiency. While prior work has explored quantization for efficient ViTs to marry the hardware efficiency of efficient hybrid ViT architectures and quantization, it focuses on uniform quantization and overlooks the potential advantages of mixed quantization. Meanwhile, although several works have studied mixed quantization for standard ViTs, they are not directly applicable to hybrid ViTs due to their distinct algorithmic and hardware characteristics. To bridge this gap, we present M2-ViT to accelerate convolution-transformer hybrid efficient ViTs with two-level mixed quantization (M2Q). Specifically, we introduce a hardware-friendly M2Q strategy, characterized by both mixed quantization precision and mixed quantization schemes [uniform and power-of-two (PoT)], to exploit the architectural properties of efficient ViTs. We further build a dedicated accelerator with heterogeneous computing engines to transform algorithmic benefits into real hardware improvements. The experimental results validate our effectiveness, showcasing an average of 80% energy-delay product (EDP) saving with comparable quantization accuracy compared to the prior work. Codes are available at <uri>https://github.com/lybbill/M2ViT</uri>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1492-1496"},"PeriodicalIF":2.8,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPICED+: Syntactical Bug Pattern Identification and Correction of Trojans in A/MS Circuits Using LLM-Enhanced Detection","authors":"Jayeeta Chaudhuri;Dhruv Thapar;Arjun Chaudhuri;Farshad Firouzi;Krishnendu Chakrabarty","doi":"10.1109/TVLSI.2025.3527382","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3527382","url":null,"abstract":"Analog and mixed-signal (A/MS) integrated circuits (ICs) are crucial in modern electronics, playing key roles in signal processing, amplification, sensing, and power management. Many IC companies outsource manufacturing to third-party foundries, creating security risks such as syntactical bugs and stealthy analog Trojans. Traditional Trojan detection methods, including embedding circuit watermarks and hardware-based monitoring, impose significant area and power overheads while failing to effectively identify and localize the Trojans. To overcome these shortcomings, we present SPICED+, a software-based framework designed for syntactical bug pattern identification and the correction of Trojans in A/MS circuits, leveraging large language model (LLM)-enhanced detection. It uses LLM-aided techniques to detect, localize, and iteratively correct analog Trojans in SPICE netlists, without requiring explicit model training, and thus incurs zero area overhead. The framework leverages chain-of-thought reasoning and few-shot learning to guide the LLMs in understanding and applying anomaly detection rules, enabling accurate identification and correction of Trojan-impacted nodes. With the proposed method, we achieve an average Trojan coverage of 93.3%, average Trojan correction rate of 91.2%, and an average false-positive rate of 1.4%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"1118-1131"},"PeriodicalIF":2.8,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143676129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SysCIM: A Heterogeneous Chip Architecture for High-Efficiency CNN Training at Edge","authors":"Shuai Wang;Ziwei Li;Yuang Ma;Yi Kang","doi":"10.1109/TVLSI.2025.3526363","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3526363","url":null,"abstract":"Neural network training is notoriously computationally intensive and time-consuming. Quantization technology is promising to improve training efficiency by using lower data bitwidths to reduce storage and computing requirements. Currently, state-of-the-art quantization training algorithms have a negligible loss of accuracy, which requires dedicated quantization circuits for dynamic quantization of large amounts of data. In addition, the matrix transposition problem during neural network training gradually becomes a challenge as the network size increases. To address this problem, we propose a quantized training architecture which is a heterogeneous architecture consisting of a computing-in-memory (CIM) macro and a systolic array. First, the CIM macro realizes efficient transpose matrix multiplication through flexible data path control, which handles the need for transpose operation of the weight matrix in neural network training. Second, the systolic array utilizes two different data flows in the forward (FW) and backward (BW) propagation for the transpose matrix multiplication of the activation matrix in neural network training and provides higher computational throughput. Then, we design efficient dedicated quantization circuits for quantization algorithms to support efficient quantization training. Experimental results show that the area and power consumption of the two specialized quantization circuits are reduced by a factor of 1.35 and 5.4, on average, compared to floating-point computing circuits. The architecture achieves 4.05 tera operations per second per wat (TOPS/W) energy efficiency @ INT8 convolutional neural network (CNN) training at the 28-nm process. Compared to a state of the art (SOTA) quantization training architecture, SysCIM shows <inline-formula> <tex-math>$1.8times $ </tex-math></inline-formula> energy efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"990-1003"},"PeriodicalIF":2.8,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143676127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable and Low-Cost NTT Architecture With Conflict-Free Memory Access Scheme","authors":"Zhenyang Wu;Ruichen Kan;Jianbo Guo;Hao Xiao","doi":"10.1109/TVLSI.2025.3526261","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3526261","url":null,"abstract":"This brief proposes a scalable multistage and multipath architecture for variable number-theoretic transform (NTT). The proposed architecture adopts multiple parallel paths, each of which uses cascaded radix-2 butterfly units (BFUs). The radix-2 scheme simplifies the control logic and the cascaded BFU structure reduces the amount of RAM banks and the frequency of memory accesses. Moreover, a conflict-free and hardware-friendly in-place memory mapping scheme is proposed to ease the adaption to multiple paths, letting it be scalable for various throughputs. Compared with state-of-the-art works, the proposed architecture uses fewer resources and has better area-time product performance without penalty in throughput.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1407-1411"},"PeriodicalIF":2.8,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An On-Chip-Training Keyword-Spotting Chip Using Interleaved Pipeline and Computation-in-Memory Cluster in 28-nm CMOS","authors":"Junyi Qian;Cai Li;Long Chen;Ruidong Li;Tuo Li;Peng Cao;Xin Si;Weiwei Shan","doi":"10.1109/TVLSI.2025.3525740","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3525740","url":null,"abstract":"To improve the precision of keyword spotting (KWS) for individual users on edge devices, we propose an on-chip-training KWS (OCT-KWS) chip for private data protection while also achieving ultralow -power inference. Our main contributions are: 1) identity interchange and interleaved pipeline methods during backpropagation (BP), enabling the pipelined execution of operations that traditionally had to be performed sequentially, reducing cache requirements for loss values by 95.8%; 2) all-digital isolated-bitline (BL)-based computation-in-memory (CIM) macro, eliminating ineffective computations caused by glitches, achieving <inline-formula> <tex-math>$2.03times $ </tex-math></inline-formula> higher energy efficiency; and 3) multisize CIM cluster-based BP data flow, designing each CIM macro collaboratively to achieve all-time full utilization, reducing 47.2% of output feature map (Ofmap) access. Fabricated in 28-nm CMOS and enhanced with a refined library characterization methodology, this chip achieves both the highest training energy efficiency of 101.5 TOPS/W and the lowest inference energy of 9.9nJ/decision among current KWS chips. By retraining a three-class depthwise-separable convolutional neural network (DSCNN), detection accuracy on the private dataset increases from 80.8% to 98.9%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1497-1501"},"PeriodicalIF":2.8,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault Bounding On-Die BCH Codes for Improving Reliability of System ECC","authors":"Seongyoon Kang;Chaehyeon Shin;Jongsun Park","doi":"10.1109/TVLSI.2024.3523899","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3523899","url":null,"abstract":"While continuous dynamic random access memory (DRAM) scaling may require an on-die error correction code (ECC) with enhanced correction capability, a double error correcting code with fault bounding scheme has not been explored. In this brief, we present the fault bounding on-die Bose-Chaudhuri–Hocquenghem (BCH) code that improves the compatibility with one-symbol error correcting system ECC used in dual data rate five (DDR5) dual in-line memory module (DIMM). By modifying the H matrix of BCH code, the proposed decoding method determines the fault boundary within which burst errors occur, effectively preventing the spread of these errors across fault boundaries. A comparison of bounded rates with conventional codes illustrates the enhanced compatibility with system ECC. The encoder and decoder of the proposed code have been implemented using a 28-nm CMOS process to demonstrate the hardware cost.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1482-1486"},"PeriodicalIF":2.8,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 65-nm 55.8-TOPS/W Compact 2T eDRAM-Based Compute-in-Memory Macro With Linear Calibration","authors":"Xueyong Zhang;Yong-Jun Jo;Tony Tae-Hyoung Kim","doi":"10.1109/TVLSI.2024.3520588","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3520588","url":null,"abstract":"Implementing parallel computing inside memory units, compute-in-memory (CIM) has shown significant energy and latency reduction, which are suitable for neural network accelerators, especially for low-power edge devices. This brief presents a compact 2T-eDRAM CIM structure to support signed 4b/4b/6b input/weight/output precision multiply-accumulate (MAC) operation, exploring a near-zero-skipping (NZS) technique to improve energy efficiency further and reduce weight update time. The center weight first (CWF) update method is proposed to extend the overall weight retention time. Furthermore, the analog multiplication and accumulation nonlinear compensation techniques are employed to improve the accuracy and linear range. Fabricated in 65-nm CMOS technology, this chip achieves the weight bit storage density of 3.7 Mb/mm2 and SWaP figure of merit of 210 TOPS/W Mb/mm2. The measured energy efficiency shows an average of 55.8 TOPS/W with the 4b/4b/6b input/weight/output precision at 1.2 V and 100 MHz.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1477-1481"},"PeriodicalIF":2.8,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shan Lu;Danyu Wu;Xuan Guo;Hanbo Jia;Yong Chen;Xinyu Liu
{"title":"A Quad-Core VCO Incorporating Area-Saving Folded S-Shaped Tail Filtering in 28-nm CMOS","authors":"Shan Lu;Danyu Wu;Xuan Guo;Hanbo Jia;Yong Chen;Xinyu Liu","doi":"10.1109/TVLSI.2024.3498940","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3498940","url":null,"abstract":"This brief reports on a 13-GHz quad-core voltage-controlled oscillator (VCO) using a folded S-shaped tail inductor. The contribution of this work is that the auxiliary resonator is folded into the main inductor, so that it leads to a more compact solution than a conventional scheme. Due to the S-shaped inductor’s electromagnetic (EM) characteristics, the proposed tail filter can achieve noise suppression without EM interference to the main tank. Designed and implemented in a 28-nm CMOS process, the proposed VCO operates between 12.32 and 13.84 GHz, for an 11.6% turning range. The measurements were carried out in the free-running mode, and the results show a phase noise (PN) of 118.3 dBc/Hz at a 1-MHz offset from the central frequency of 12.32 GHz. The power consumption of the VCO core is 24.5 mW, with a 0.9-V supply voltage, and this leads to a figure of merit (FoM) of 186.6 dBc/Hz.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"1162-1166"},"PeriodicalIF":2.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143676078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information","authors":"","doi":"10.1109/TVLSI.2024.3517117","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3517117","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"C3-C3"},"PeriodicalIF":2.8,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10818619","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}