{"title":"A Dual-Mode Continuous–Time Sigma-Delta Modulator With a Reconfigurable Loop Filter Based on a Single Op-Amp Resonator","authors":"Young-Kyun Cho","doi":"10.1109/TVLSI.2024.3414298","DOIUrl":"10.1109/TVLSI.2024.3414298","url":null,"abstract":"This brief proposes a dual-mode continuous-time (CT) sigma-delta modulator (SDM) for switched-mode power supplies comprising a switchable loop filter (LF) based on a single op-amp resonator (SOR). The proposed modulator adaptively adjusts the LF architecture between the third and second order and optimizes the noise transfer function (NTF) using the partial resistors as per the sampling frequency. This facilitates the desired bandwidth and resolution while mitigating design complexity and minimizing the need for tuning circuitry. Moreover, the LF implemented with the SOR enhances both the power and area efficiency of the modulator in each operating mode by reducing the number of active components. The modulator was fabricated based on an 0.18-\u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000m CMOS process with an active area of 0.105 mm2. It achieved peak signal-to-noise ratios (SNRs) of 66.0/65.3 dB for signal bandwidths of 0.5/1.1 MHz. The power consumptions were 127/\u0000<inline-formula> <tex-math>$280~mu $ </tex-math></inline-formula>\u0000W from a 1.8-V supply when clocked at 40/160 MHz. The figures of merit for each mode were 82/93 fJ/conv.-step.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1754-1758"},"PeriodicalIF":2.8,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-Precision Mixed-Computation Models for Inference on Edge","authors":"Seyedarmin Azizi;Mahdi Nazemi;Mehdi Kamal;Massoud Pedram","doi":"10.1109/TVLSI.2024.3409640","DOIUrl":"10.1109/TVLSI.2024.3409640","url":null,"abstract":"This article presents a mixed-computation neural network processing approach for edge applications that incorporates low-precision (low-width) Posit and low-precision fixed point (FixP) number systems. This mixed-computation approach uses 4-bit Posit (Posit4), which has higher precision around 0, for representing weights with high sensitivity, while it uses 4-bit FixP (FixP4) for representing other weights. A heuristic for analyzing the importance and the quantization error of the weights is presented to assign the proper number system to different weights. In addition, a gradient approximation for Posit representation is introduced to improve the quality of weight updates in the backpropagation process. Due to the high energy consumption of the fully Posit-based computations, neural network operations are carried out in FixP or Posit/FixP. An efficient hardware implementation of an MAC operation with a first Posit operand and FixP for a second operand and accumulator is presented. The efficacy of the proposed low-precision mixed-computation approach is extensively assessed on vision and language models. The results show that on average, the accuracy of the mixed-computation is about 1.5% higher than that of FixP with a cost of 0.19% energy overhead.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 8","pages":"1414-1422"},"PeriodicalIF":2.8,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing ConvNets With ConvFIFO: A Crossbar PIM Architecture Based on Kernel-Stationary First-In-First-Out Dataflow","authors":"Yu Qian;Liang Zhao;Fanzi Meng;Xiapeng Xu;Cheng Zhuo;Xunzhao Yin","doi":"10.1109/TVLSI.2024.3409648","DOIUrl":"10.1109/TVLSI.2024.3409648","url":null,"abstract":"Convolutional neural networks (ConvNets) have long been the model of choice for computer vision (CV) problems and gained renewed traction lately. In order to compute ConvNets more efficiently, process-in-memory (PIM) architectures based on emerging non-volatile memories (NVMs) such as RRAM have been widely studied. However, conventional NVM-based PIM suffered from various non-idealities including IR drop, sneak-path currents, large analog-to-digital converter (ADC) overhead, device variations, circuits mismatch, and error propagation. In this work, we propose ConvFIFO, a crossbar-memory-based PIM architecture for ConvNets featuring a kernel-stationary dataflow. Through the design of FIFO-type input and output buffers, smaller row-activation parallelism, and more compact ADCs, ConvFIFO can maximize the reuse rates of inputs and partial sums to achieve a more balanced trade-off among throughput, accuracy, and area/energy consumption. Using SRAM-based FIFO as the input/output buffer, ConvFIFO achieves a systolic architecture without the need to move weight data, bypassing the limitation of NVM endurance and minimizing the movement of partial sums. Moreover, the FIFO nature of the dataflow allows flexible pipeline design and load balancing. Compared to classical NVM-based PIM architectures such as ISAAC, ConvFIFO exhibits significant performance enhancement for various ConvNet models, showing 1.66–\u0000<inline-formula> <tex-math>$1.69times $ </tex-math></inline-formula>\u0000/1.69–\u0000<inline-formula> <tex-math>$1.74times $ </tex-math></inline-formula>\u0000/4.23–\u0000<inline-formula> <tex-math>$4.79times $ </tex-math></inline-formula>\u0000/1.59–\u0000<inline-formula> <tex-math>$1.74times $ </tex-math></inline-formula>\u0000 improvement in terms of energy consumption, latency, Ops/W, and Ops/s\u0000<inline-formula> <tex-math>$times $ </tex-math></inline-formula>\u0000mm2, respectively. Compared to GPUs, ConvFIFO exhibits only an average accuracy loss of 1.82% during inference.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1640-1651"},"PeriodicalIF":2.8,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Design Framework for Generating Energy-Efficient Accelerator on FPGA Toward Low-Level Vision","authors":"Zikang Zhou;Xuyang Duan;Jun Han","doi":"10.1109/TVLSI.2024.3409649","DOIUrl":"10.1109/TVLSI.2024.3409649","url":null,"abstract":"Low-level vision algorithms play an increasingly crucial role in a wide range of applications, such as biomedical, security, and autopilot. The low-level vision accelerators have also been extensively researched. As low-level vision is often deployed in embedded devices, its accelerators need to achieve high energy efficiency. Meanwhile, the broad application scenarios of low-level vision contribute to its rapid iteration. Designing energy-efficient accelerators for quickly evolving low-level vision algorithms demands substantial effort. Therefore, a design framework specifically tailored for the generation of low-level vision accelerators is urgently needed. In this article, we propose an end-to-end algorithm-hardware generation framework, EffiVision, on field-programmable gate array (FPGA), aimed at generating highly energy-efficient dedicated accelerators for low-level vision neural networks. EffiVision proposes a hardware template that features multiple parallelisms and large architecture exploration spaces specifically designed to accommodate the characteristics of low-level vision networks. Then, it employs activation-weight aware mixed-precision quantization and FPGA-aware NNLUTs to search the suitable hardware parameters within the hardware template, generating highly energy-efficient accelerators tailored for low-level vision networks. We used EffiVision to perform hardware generation for three low-level vision neural networks fast super-resolution convolutional neural network (FSRCNN), denoising convolutional neural network (DnCNN), and demosaicing convolutional neural network (DMCNN) on Xilinx FPGA development boards, achieving the best energy efficiencies of 174.9, 97.8, and 92.7 GOPS/W, respectively. The generated accelerators of FSRCNN and DnCNN are \u0000<inline-formula> <tex-math>$1.11times $ </tex-math></inline-formula>\u0000 and \u0000<inline-formula> <tex-math>$3.37times $ </tex-math></inline-formula>\u0000 more efficient than previous works.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 8","pages":"1485-1497"},"PeriodicalIF":2.8,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ALT-Lock: Logic and Timing Ambiguity-Based IP Obfuscation Against Reverse Engineering","authors":"Jonti Talukdar;Woo-Hyun Paik;Eduardo Ortega;Krishnendu Chakrabarty","doi":"10.1109/TVLSI.2024.3411033","DOIUrl":"10.1109/TVLSI.2024.3411033","url":null,"abstract":"We present a logic ambiguity-based intellectual property (IP) obfuscation method that replaces traditional key gates with key-controlled functionally ambiguous logic gates, called LGA gates. We also protect timing paths by developing timing-ambiguous sequential cells called TA cells. We call this locking scheme ambiguous logic and timing logic locking (referred to as ALT-Lock). ALT-Lock ensures a two-pronged system-level security scheme where the attacker is forced to unlock not only combinational logic obfuscation but also timing obfuscation. We show that a combination of logic and timing ambiguity (TA) provides security against oracle-guided attacks. This method is superior to other traditional IP protection schemes such as combinational or sequential locking as it guarantees security against both oracle-guided and oracle-free attacks, while ensuring low power, performance, and area (PPA) overhead.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 8","pages":"1535-1548"},"PeriodicalIF":2.8,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Endurance-Aware Compiler for 3-D Stackable FeRAM as Global Buffer in TPU-Like Architecture","authors":"Yuan-Chun Luo;Anni Lu;Yandong Luo;Sou-Chi Chang;Uygar Avci;Shimeng Yu","doi":"10.1109/TVLSI.2024.3412631","DOIUrl":"10.1109/TVLSI.2024.3412631","url":null,"abstract":"Emerging nonvolatile memories as embedded memories offer low leakage power and high memory density, compared to the static random access memory (SRAM) and embedded dynamic random access memory (eDRAM) at the same technology node. However, the emerging memories generally suffer from limited cycling endurance. For read/write intensive applications, the limited endurance could become a bottleneck that limits the lifetime of the overall system. In this work, Intel’s reported prototype 3-D stackable ferroelectric random access memory (FeRAM) is considered as the global buffer memory of a tensor-processing-unit (TPU)-like architecture. An endurance-aware compiler is proposed to evaluate the maximum number of deep neural network (DNN) trainings considering the experimentally measured endurance limit. In addition, the proposed compiler applies two strategies to alleviate the endurance issue. The first strategy is wear leveling, and the second strategy is the dual-mode operation between volatile and nonvolatile modes. The maximum numbers of trainings increase by \u0000<inline-formula> <tex-math>$6times $ </tex-math></inline-formula>\u0000 to \u0000<inline-formula> <tex-math>$300times $ </tex-math></inline-formula>\u0000 and \u0000<inline-formula> <tex-math>$4times $ </tex-math></inline-formula>\u0000 to \u0000<inline-formula> <tex-math>$58times $ </tex-math></inline-formula>\u0000 thanks to the wear-leveling and dual-mode operations, respectively. Finally, a guideline of the system endurance (maximum number of trainings) is provided with given memory device endurance to bridge the gap between memory device engineers and system designers.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1696-1703"},"PeriodicalIF":2.8,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gain and Power Enhancement With Coupled Technique for a Distributed Power Amplifier in 0.25- μm GaN HEMT Technology","authors":"Xu Yan;Jingyuan Zhang;Guansheng Lv;Wenhua Chen;Yongxin Guo","doi":"10.1109/TVLSI.2024.3411143","DOIUrl":"10.1109/TVLSI.2024.3411143","url":null,"abstract":"In this article, a fully integrated 1.0–11.0-GHz wideband distributed power amplifier (DPA) monolithic microwave integrated circuit (MMIC) design is presented. Particularly, a coupled technique with bandpass (CTB) characteristic between the kth output node and the (\u0000<inline-formula> <tex-math>$k+1$ </tex-math></inline-formula>\u0000)th input node of amplification units (AUs) is adopted in the DPA design. It generates an additional signal reuse path (SRP) to reuse part of the output signal to superimpose the input signal, and then they will be reamplified to the output artificial transmission line (O-ATML). Moreover, due to the bandpass characteristic, the signal reuse can be manipulated to target the upper cutting edges of the working band to alleviate sharp gain and power roll-off. By carefully controlling the SRP, the overall gain, output power, and bandwidth are enhanced and extended. The systematic design approach for the DPA is detailed with circuit implementations and optimizations. To validate the proposed concept, a DPA MMIC prototype is implemented and fabricated in a commercial 0.25-\u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000m gallium nitride (GaN)-on-silicon carbide (SiC) high-electron-mobility transistor (HEMT) process. It shows the compact layout within a die size of 3.36 mm2. Under 28-V VDD power supply, the measured results show a flat \u0000<inline-formula> <tex-math>$14.8pm 1.0$ </tex-math></inline-formula>\u0000-dB small-signal gain with 10.0-GHz wide operating bandwidth and good impedance matching conditions. A saturated output power (\u0000<inline-formula> <tex-math>${P} _{text {sat}}$ </tex-math></inline-formula>\u0000) of 7.25 W with peak power-added efficiency (PAE) exceeding 38.7% is achieved. The proposed DPA obtains around 1.54–2.16-W/mm2 power density associated with an average PAE of 34.5% over the entire frequency range.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 8","pages":"1523-1534"},"PeriodicalIF":2.8,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hardware and Software Co-Design for Energy-Efficient Neural Network Accelerator With Multiplication-Less Folded-Accumulative PE for Radar-Based Hand Gesture Recognition","authors":"Fan Li;Yunqi Guan;Wenbin Ye","doi":"10.1109/TVLSI.2024.3409674","DOIUrl":"10.1109/TVLSI.2024.3409674","url":null,"abstract":"This work presents a novel lightweight neural network (NN) model and a dedicated NN accelerator for radar-based hand gesture recognition (HGR). The NN model employs symmetric weights, group 1-D-convolution, and power-of-two (POT) quantization, achieving 92.84% accuracy on a public dataset with only 4.8 k parameters, while reducing parameter storage by 40%. The custom accelerator features a multiplication-less folded-accumulative processing element (PE), group-wise computation optimization, and an efficient scheduling mechanism for fully connected (FC) layers. Implemented on a Xilinx field-programmable gate array (FPGA) board XC7S15 and 65-nm CMOS technology, it surpasses existing solutions in power efficiency and cost-effectiveness, addressing the computational demands for IoT deployment.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 10","pages":"1964-1968"},"PeriodicalIF":2.8,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runze Yu;Zhenhao Li;Xi Deng;Zhaoxu Wang;Wei Jia;Haoming Zhang;Zhenglin Liu
{"title":"iEDCL: Streamlined, False-Error-Free Error Detection and Correction Scheme in a Near-Threshold Enabled 32-bit Processor","authors":"Runze Yu;Zhenhao Li;Xi Deng;Zhaoxu Wang;Wei Jia;Haoming Zhang;Zhenglin Liu","doi":"10.1109/TVLSI.2024.3409315","DOIUrl":"10.1109/TVLSI.2024.3409315","url":null,"abstract":"This article presents internal error detection, correction, and latching (iEDCL), a designer-friendly, fully functional error detection and correction (EDAC) approach tailored for energy-efficient near-threshold systems capable of tolerating variations. It embeds error detection (ED), correction, and latching circuits within a flip-flop (FF) with an additional 15 transistors to monitor critical paths. Notably, iEDCL’s error-aware capability remains stable despite clock latency and parasitic effects, relieving designers of extensive involvement and eliminating false errors. iEDCL is automatedly implemented in an ARM Cortex-M0 processor at 55 nm without extra architecture modifications, incurring only a 6.78% area overhead. An adaptive voltage scaling (AVS) loop enables automatic operation, achieving high energy efficiency beyond the point of the first failure while maintaining a predefined error rate. Measurement results obtained from different dies at various temperatures demonstrate significant energy savings achieved by the iEDCL processor, with up to 16.9% and 49.1% reductions compared to critical baseline and signoff designs, respectively, while maintaining a 5% error rate at a 16 MHz frequency. To the best of our knowledge, this article presents one of the first FF EDAC implementations fully operational without potential false errors at near-threshold voltages while enhancing energy efficiency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 8","pages":"1436-1446"},"PeriodicalIF":2.8,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fan Chen;Wei Li;Chuangguo Wang;Yunyou Pu;Xingyu Ma;Shijiao Dong;Yun Wang;Hongtao Xu
{"title":"Enhanced-Linearity Wideband Full-Duplex Receiver With Shared Self-Interference Canceller","authors":"Fan Chen;Wei Li;Chuangguo Wang;Yunyou Pu;Xingyu Ma;Shijiao Dong;Yun Wang;Hongtao Xu","doi":"10.1109/TVLSI.2024.3410010","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3410010","url":null,"abstract":"A wideband full-duplex (FD) receiver with enhanced-linearity technique and shared self-interference cancellation (SIC) is implemented in a 40-nm CMOS process. By combining Hilbert-transform-equalization (HTE)-based self-interference (SI) canceller and translational loop, an FD receiver with RF domain cancellation is presented with an extra auxiliary cancellation path by reusing the mixer in the translational loop. By introducing the auxiliary path, the influence of SI circuit to receiver front end is minimized. Meanwhile, a self-loaded linearization technique with acceptable noise degradation and extra power consumption is proposed to be employed in the FD receiver for both receiver and SI canceller. Due to the 2-D regulation, such a technique can achieve a relatively robust linearity improvement and bring flexibility to circuit design. The measurement results show that the proposed FD receiver operates across 0.8–3.5 GHz with a gain of 29.0–31.8 dB and a noise figure of 3.68–5.23 dB. The proposed linearization technique achieves 3.2–4.7-dB linearity improvement for receiver with only 0.45–0.64-dB NF degradation. In addition, the canceller with the proposed linearization method achieves RF domain delays ranging from 1.59 to 4.03 ns while demonstrating more than 6.33-dB linearity improvement. With the implementation of self-loaded technique and shared SIC, a greater than 23.4-dB RF domain SI suppression is measured across 40-MHz bandwidth (BW) with 64-QAM modulated signals in a circulator-based setup for the SIC scheme in this work with RX noise degradation of less than 1.38 dB.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1578-1589"},"PeriodicalIF":2.8,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}