{"title":"A Picowatt CMOS Voltage Reference Using Independent TC and Output Level Calibrations","authors":"Yuyang Li;Ryan Caginalp;Inhee Lee","doi":"10.1109/TVLSI.2024.3508259","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3508259","url":null,"abstract":"We propose a low-power voltage reference that enables independent adjustment of temperature sensitivity and output level. This design enhances the temperature sensitivity without impacting the output level distribution, in contrast to previous methods. The proposed circuit achieves this by integrating a separate control system that utilizes diode-connected pMOS transistors and an analog multiplexer for output level adjustment, along with biasing current control to improve the temperature sensitivity. In a 180-nm CMOS process, the prototype circuit generates a stable reference voltage averaging 192 mV, maintaining an accuracy of ±8.8 mV (<inline-formula> <tex-math>$pm 3sigma $ </tex-math></inline-formula>) from 0 °C to 75 °C across ten samples. In addition, it consumes only 35.8 pW at 0.6 V and 25 °C.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1244-1254"},"PeriodicalIF":2.8,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MCM-SR: Multiple Constant Multiplication-Based CNN Streaming Hardware Architecture for Super-Resolution","authors":"Seung-Hwan Bae;Hyuk-Jae Lee;Hyun Kim","doi":"10.1109/TVLSI.2024.3504513","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3504513","url":null,"abstract":"Convolutional neural network (CNN)-based super-resolution (SR) methods have become prevalent in display devices due to their superior image quality. However, the significant computational demands of CNN-based SR require hardware accelerators for real-time processing. Among the hardware architectures, the streaming architecture can significantly reduce latency and power consumption by minimizing external dynamic random access memory (DRAM) access. Nevertheless, this architecture necessitates a considerable hardware area, as each layer needs a dedicated processing engine. Furthermore, achieving high hardware utilization in this architecture requires substantial design expertise. In this article, we propose methods to reduce the hardware resources of CNN-based SR accelerators by applying the multiple constant multiplication (MCM) algorithm. We propose a loop interchange method for the convolution (CONV) operation to reduce the logic area by 23% and an adaptive loop interchange method for each layer that considers both the static random access memory (SRAM) and logic area simultaneously to reduce the SRAM size by 15%. In addition, we improve the MCM graph exploration speed by \u0000<inline-formula> <tex-math>$5.4times $ </tex-math></inline-formula>\u0000 while maintaining the SR quality through beam search when CONV weights are approximated to reduce the hardware resources.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"75-87"},"PeriodicalIF":2.8,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Letian Guo;Jincheng Zhang;Lihe Nie;Jian Wang;Yong Chen;Junyan Ren;Shunli Ma
{"title":"A Harmonic-Suppressed GaN Power Amplifier Using Artificial Coupled Resonator","authors":"Letian Guo;Jincheng Zhang;Lihe Nie;Jian Wang;Yong Chen;Junyan Ren;Shunli Ma","doi":"10.1109/TVLSI.2024.3487002","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3487002","url":null,"abstract":"This brief presents an 11.5–17.5-GHz power amplifier (PA) with 32-dBm output power in a 0.25-<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>m gallium nitride (GaN) process. Capacitively and inductively coupled resonators are used for impedance matching to achieve a flat in-band power gain and a high out-of-band rejection. Meanwhile, the output matching network provides a second-harmonic suppression to improve the average efficiency within the bandwidth of the PA. The measurements show that the proposed PA exhibits an output power of 31–32.5 dBm and a power gain of more than 10.5 dB from 11.5 to 17.5 GHz. Due to the matching networks providing convenient dc feed and dc block, the chip dimension is only <inline-formula> <tex-math>$2.1times 1.1$ </tex-math></inline-formula> mm2, corresponding to a power density of 0.77 W/mm2. The proposed PA demonstrates a competitive fractional bandwidth and power density in GaN PA monolithic microwave integrated circuits (MMICs).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"882-886"},"PeriodicalIF":2.8,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Single-Ended High-Voltage-Compliant 11-bit Current-Steering Digital-to-Analog Converter for Adaptive Noise Cancellation in Power Over Data Line Networks","authors":"Felix Burkhardt;Florian Protze;Frank Ellinger","doi":"10.1109/TVLSI.2024.3496845","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496845","url":null,"abstract":"Automotive Ethernet is considered to be the backbone of future in-vehicle data communication. One main feature is its ability to simultaneously transmit data and energy via power over data lines (PoDL). This article proposes the design of a single-ended high-voltage (HV)-compliant 11-bit current-steering digital-to-analog converter (DAC). The converter is tailored for the utilization as digitally controlled current source in an adaptive noise-cancellation filter for PoDL networks. Designed in an HV-compliant 180-nm bipolar complementary metal-oxide-semiconductor (BiCMOS) semiconductor technology, the DAC features a monolithically combined topology of two identical 10-bit low-voltage (LV) current-steering DACs supplied at 1.8 V and two complementary HV-compliant output current stages. Main design features of the segmented LV DAC are the utilization of single-ended current cells with an optimized switching logic, proposed to enhance the cells transient performance and energy efficiency. Furthermore, a newly derived <inline-formula> <tex-math>$Q^{4}$ </tex-math></inline-formula> asymmetric rotated walk switching scheme is investigated. At a maximum output voltage of 60 V, the proposed DAC can deliver a bidirectional output current with the amplitudes of up to 500 mA. The proposed DAC exhibits the highest voltage compliance combined with the highest output current compared with related works. It also features the second highest resolution. Operated at a sample rate of 10 MS/s with a resolution of 11 bit, a spurious-free dynamic range (SFDR) of 57.8 dB could be measured for a synthesized single tone at 100 kHz, as well as a maximum integral nonlinearity (INL) error of 1.61 LSB and a differential nonlinearity (DNL) error of 1.05 LSB.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"638-650"},"PeriodicalIF":2.8,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPOT: Fast and Optimal Built-In Redundancy Analysis Using Smart Potential Case Collection","authors":"Donghyun Han;Sunghoon Kim;Dayoung Kim;Sungho Kang","doi":"10.1109/TVLSI.2024.3499955","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3499955","url":null,"abstract":"With advancements in manufacturing and design technology, memory integration density has improved. However, as integration density increases, the cost of testing and repairing memory has also risen, posing a significant challenge in memory production. To address this challenge, built-in self-repair (BISR) has been proposed. Traditional built-in redundancy analysis (BIRAs) performs limited analysis of faults during the fault collection process, resulting in a significant delay in generating a repair solution after the test sequence is completed. This inefficiency arises from the time required to repair the memory posttest. This article proposes a new fast and optimal BIRA using smart potential case collection. The proposed BIRA conducts a detailed analysis of detected faults during the test process. Using this novel fault collection results, a potential case is generated. This is a repair case that can repair the memory with a high probability and is generated immediately after the test sequence ends. If the memory cannot be repaired by the potential case, an exhaustive search is conducted for the faults requiring further analysis to generate an optimal repair solution. Compared to previous studies, the proposed BIRA demonstrates extremely low analysis time with an optimal repair rate.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"780-792"},"PeriodicalIF":2.8,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Parallel Feed-Forward Current Ripple Rejection (PFFCRR) Technique for High Load Current High PSRR nMOS LDOs","authors":"Yuhong Lu;Ting-An Yen;Rakshit Dambe Nayak;Shashank Alevoor;Bhushan Talele;Spoorti Patil;Keith Kunz;Bertan Bakkaloglu","doi":"10.1109/TVLSI.2024.3497803","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3497803","url":null,"abstract":"There is a significant demand in systems-on-chip (SoCs) for a high-power efficiency low-dropout regulator (LDO) that provides lower dropout voltage, higher load current, and low quiescent current. A high-power supply rejection ratio (PSRR) at the mid-to-high frequency band (0.1–10 MHz) is crucial for LDO to generate low-noise power supplies when driven by switching power converters. However, this presents a significant challenge to enhancing the PSRR since the pass field-effect transistor (FET) operates in the deep triode region at high-current and dropout conditions. In this article, a parallel feed-forward current ripple rejection (PFFCRR) technique is proposed to improve the PSRR performance regardless of the operation region of the nMOS pass FET. The proposed approach senses the supply-induced current ripple and cancels the original ripple through a current path that runs parallel to the nMOS pass FET. The proposed LDO is fabricated in a 180-nm BCD process. The proposed LDO achieves a PSRR better than −35 dB up to 10 MHz at 300-mV dropout voltage with 0.5-A load current and a load capacitor of <inline-formula> <tex-math>$2.2~mu $ </tex-math></inline-formula>F. The PFFCRR approach achieves a PSRR improvement of 18 dB at 1 MHz at 100-mV dropout voltage with a 2.15-A load current when the pass FET operates in the deep triode region. Moreover, the proposed LDO enhances the transient performance with an overshoot and an undershoot of 40.54 and 36.45 mV, respectively, against <inline-formula> <tex-math>$Delta {I}_{text {LOAD}}$ </tex-math></inline-formula> of 1 A with a slew rate of 1 A/<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>s.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"651-661"},"PeriodicalIF":2.8,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuming Guo;Yinyin Lin;Hao Wang;Yao Li;Chongyan Gu;Weiqiang Liu;Yijun Cui
{"title":"A 0.09-pJ/Bit Logic-Compatible Multiple-Time Programmable (MTP) Memory-Based PUF Design for IoT Applications","authors":"Shuming Guo;Yinyin Lin;Hao Wang;Yao Li;Chongyan Gu;Weiqiang Liu;Yijun Cui","doi":"10.1109/TVLSI.2024.3496735","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496735","url":null,"abstract":"The Internet of Things (IoT) allows devices to interact for real-time data transfer and remote control. However, IoT hardware devices have been shown security vulnerabilities. Edge device authentications, as a crucial process for IoT systems, generate and use unique IDs for secure data transmissions. Conventional authentication techniques, computational and heavyweight, are challenging and infeasible in IoT due to limited resources in IoTs. Physical unclonable functions (PUFs), a lightweight hardware-based security primitive, were proposed for resource-constrained applications. We propose a new PUF design for resource-constrained IoT devices based on low-cost logic-compatible multiple-time programmable (MTP) memory cells. The structure includes an array of MTP differential memory cells and a PUF extraction circuit. The extraction method uses the random distribution of BL current after programming each memory cell in logic-compatible MTP memory as the entropy source of PUF. Responses are obtained by comparing the current values of two memory cells under a certain address by challenge, forming challenge–response pairs (CRPs). This scheme does not increase hardware consumption and circuit differences on edge devices and is intrinsic PUF. Finally, 200 PUF chips were fabricated by CSMC based on the 0.153-\u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000m MCU single-gate CMOS process. The performance of the logic-compatible MTP memory cell and its PUF was evaluated. A logic-compatible MTP cell has good programming erase efficiency and good durability and retention. The uniqueness of the proposed PUF is 50.29%, the uniformity is 51.82%, and the reliability is 93.61%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"248-260"},"PeriodicalIF":2.8,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianning Gao;Yifan Wang;Ming Zhu;Xiulong Wu;Dian Zhou;Zhaori Bi
{"title":"An RISC-V PPA-Fusion Cooperative Optimization Framework Based on Hybrid Strategies","authors":"Tianning Gao;Yifan Wang;Ming Zhu;Xiulong Wu;Dian Zhou;Zhaori Bi","doi":"10.1109/TVLSI.2024.3496858","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496858","url":null,"abstract":"The optimization of RISC-V designs, encompassing both microarchitecture and CAD tool parameters, is a great challenge due to an extensive and high-dimensional search space. Conventional optimization methods, such as case-specific approaches and black-box optimization approaches, often fall short of addressing the diverse and complex nature of RISC-V designs. To achieve optimal results across various RISC-V designs, we propose the cooperative optimization framework (COF) that integrates multiple black-box optimizers, each specializing in different optimization problems. The COF introduces the landscape knowledge exchange mechanism (LKEM) to direct the optimizers to share their knowledge of the optimization problem. Moreover, the COF employs the dynamic computational resource allocation (DCRA) strategies to dynamically allocate computational resources to the optimizers. The DCRA strategies are guided by the optimizer efficiency evaluation (OEE) mechanism and a time series forecasting (TSF) model. The OEE provides real-time performance evaluations. The TSF model forecasts the optimization progress made by the optimizers, given the allocated computational resources. In our experiments, the COF reduced the cycle per instruction (CPI) of the Berkeley out-of-order machine (BOOM) by 15.36% and the power of Rocket-Chip by 12.84% without constraint violation compared to the respective initial designs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"140-153"},"PeriodicalIF":2.8,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dhandeep Challagundla;Ignatius Bezzam;Riadul Islam
{"title":"ArXrCiM: Architectural Exploration of Application-Specific Resonant SRAM Compute-in-Memory","authors":"Dhandeep Challagundla;Ignatius Bezzam;Riadul Islam","doi":"10.1109/TVLSI.2024.3502359","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3502359","url":null,"abstract":"While general-purpose computing follows von Neumann’s architecture, the data movement between memory and processor elements dictates the processor’s performance. The evolving compute-in-memory (CiM) paradigm tackles this issue by facilitating simultaneous processing and storage within static random-access memory (SRAM) elements. Numerous design decisions taken at different levels of hierarchy affect the figures of merit (FoMs) of SRAM, such as power, performance, area, and yield. The absence of a rapid assessment mechanism for the impact of changes at different hierarchy levels on global FoMs poses a challenge to accurately evaluating innovative SRAM designs. This article presents an automation tool designed to optimize the energy and latency of SRAM designs incorporating diverse implementation strategies for executing logic operations within the SRAM. The tool structure allows easy comparison across different array topologies and various design strategies to result in energy-efficient implementations. Our study involves a comprehensive comparison of over 6900+ distinct design implementation strategies for École Polytechnique Fédérale de Lausanne (EPFL) combinational benchmark circuits on the energy-recycling resonant CiM (rCiM) architecture designed using Taiwan Semiconductor Manufacturing Company (TSMC) 28-nm technology. When provided with a combinational circuit, the tool aims to generate an energy-efficient implementation strategy tailored to the specified input memory and latency constraints. The tool reduces 80.9% of energy consumption on average across all benchmarks while using the six-topology implementation compared with the baseline implementation of single-macro topology by considering the parallel processing capability of rCiM cache size ranging from 4 to 192 kB.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"179-192"},"PeriodicalIF":2.8,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MASL-AFU: A High Memory Access Efficiency 2-D Scalable LUT-Based Activation Function Unit for On-Device DNN Training","authors":"Zhaoteng Meng;Lin Shu;Jianing Zeng;Zhan Li;Kailin Lv;Haoyue Yang;Jie Hao","doi":"10.1109/TVLSI.2024.3488782","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3488782","url":null,"abstract":"On-device deep neural network (DNN) training faces constraints in storage capacity and energy supply. Existing works primarily focus on optimizing the training of convolutional and batch normalization (BN) layers to improve the compute-to-communication (CTC) ratio and reduce the energy cost of off-chip memory access (MA). However, the training of activation layers remains challenging due to the additional off-chip MA required for derivative calculations. This article proposes MASL-AFU, an architecture designed to accelerate the activation layer in on-device DNN training. MASL-AFU leverages nonuniform piecewise linear (NUPWL) functions to speed up the forward propagation (FP) in the activation layer. During the error propagation (EP) process, retrieving derivatives from a lookup table (LUT) eliminates the need for redundant retrieval of the input data used in FP. By storing LUT indices instead of the original activation inputs, MASL-AFU significantly reduces and accelerates MA. Compared to other activation function units, MASL-AFU offers up to a <inline-formula> <tex-math>$5.8times $ </tex-math></inline-formula> increase in computational and off-chip MA efficiency. In addition, MASL-AFU incorporates two dimensions of scalability: data precision and the number of LUT entries. These scalable, hardware-friendly methods enhance MASL-AFU’s area efficiency by up to <inline-formula> <tex-math>$3.24times $ </tex-math></inline-formula> and energy efficiency by up to <inline-formula> <tex-math>$3.85times $ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"707-719"},"PeriodicalIF":2.8,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}