{"title":"A 28-Gb/s Single-Ended PAM-4 Transceiver With Active-Inductor Equalizer and Amplitude- Detection LSB Decoder for Memory Interfaces","authors":"Hwaseok Shin;Hyoshin Kang;Yoonjae Choi;Jincheol Sim;Jonghyuck Choi;Youngwook Kwon;Seungwoo Park;Seongcheol Kim;Changmin Sim;Junseob So;Taehwan Kim;Chulwoo Kim","doi":"10.1109/TVLSI.2024.3496878","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496878","url":null,"abstract":"This study proposes a power-efficient 28-Gb/s single-ended four-level pulse amplitude modulation (PAM-4) transceiver (TRX) for next-generation memory interfaces. In the transmitter (TX), an active-inductor equalizer (EQAI) is utilized, while in the receiver (RX), an amplitude-detection least significant bit (LSB) decoder is employed. In the TX, conventional equalization techniques consume substantial power owing to the inclusion of additional components and strong driving power required to mitigate channel-induced intersymbol interference (ISI). However, the proposed EQAI achieves a bandwidth extension up to the Nyquist frequency through gain boosting while reducing hardware costs and minimizing the driving strength. This results in a simple structure with operational efficiency, facilitating low power consumption and a compact area compared with conventional TX equalizers. In PAM-4 RX, the power dissipation is proportional to the clock buffer and the number of comparators used for data decoding. To improve the hardware cost and the power usage in the RX, the proposed RX design utilizes an amplitude-detection LSB decoder, which reduces the number of comparators and comprises a one-stage structure by detecting the amplitude differences between the reference and input voltages during LSB decoding. This ensures the hardware cost and power consumption improvement while implementing a one-tap direct decision feedback equalizer (DFE). The TRX for memory interfaces is optimized for low-power performance by employing these methods, resulting in a notable energy efficiency of 0.96 pJ/bit. This structure is fabricated using a 28-nm CMOS technology, and the core area of the TRX occupies 0.0053 mm2.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"662-672"},"PeriodicalIF":2.8,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandre Almeida da Silva;Lucas Nogueira;Alexandre Coelho;Jarbas A. N. Silveira;César Marcon
{"title":"Securet3d: An Adaptive, Secure, and Fault-Tolerant Aware Routing Algorithm for Vertically–Partially Connected 3D-NoC","authors":"Alexandre Almeida da Silva;Lucas Nogueira;Alexandre Coelho;Jarbas A. N. Silveira;César Marcon","doi":"10.1109/TVLSI.2024.3500575","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3500575","url":null,"abstract":"Multiprocessor systems-on-chip (MPSoCs) based on 3-D networks-on-chip (3D-NoCs) are crucial architectures for robust parallel computing, efficiently sharing resources across complex applications. To ensure the secure operation of these systems, it is essential to implement adaptive, fault-tolerant mechanisms capable of protecting sensitive data. This work proposes the Securet3d routing algorithm, which establishes secure data paths in fault-tolerant 3D-NoCs. Our approach enhances the Reflect3d algorithm by introducing a detailed scheme for mapping secure paths and improving the system’s ability to withstand faults. To validate its effectiveness, we compare Securet3d with three other fault-tolerant routing algorithms for vertically-partially connected 3D-NoCs. All algorithms were implemented in SystemVerilog and evaluated through simulation using ModelSim and hardware synthesis with Cadence’s Genus tool. Experimental results show that Securet3d reduces latency and enhances cost-effectiveness compared with other approaches. When implemented with a 28-nm technology library, Securet3d demonstrates minimal area and energy overhead, indicating scalability and efficiency. Under denial-of-service (DoS) attacks, Securet3d maintains basically unaltered average packet latencies on 70, 90, and 29 clock cycles for uniform random, bit-complement, and shuffle traffic, significantly lower than those of other algorithms without including security mechanisms (5763, 4632, and 3712 clock cycles in average, respectively). These results highlight the superior security, scalability, and adaptability of Securet3d for complex communication systems.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"275-287"},"PeriodicalIF":2.8,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Retry-Based Synchronization for Online Testing of Identical Logic Blocks","authors":"Irith Pomeranz","doi":"10.1109/TVLSI.2024.3501402","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3501402","url":null,"abstract":"State-of-the-art designs include identical instances of logic blocks to support parallel computations. Identical logic blocks at close physical proximity can be tested online by comparing their output sequences. This removes the need for known input and output sequences. To use output comparison for two logic blocks, <inline-formula> <tex-math>$B_{0}$ </tex-math></inline-formula> and <inline-formula> <tex-math>$B_{1}$ </tex-math></inline-formula>, the logic blocks should be synchronized to the same state, and the same input sequence should be applied to them. Assuming that <inline-formula> <tex-math>$B_{0}$ </tex-math></inline-formula> performs functional computations and <inline-formula> <tex-math>$B_{1}$ </tex-math></inline-formula> is idle, a process described earlier synchronizes <inline-formula> <tex-math>$B_{1}$ </tex-math></inline-formula> to the state of <inline-formula> <tex-math>$B_{0}$ </tex-math></inline-formula> by using a synchronization period where <inline-formula> <tex-math>$B_{1}$ </tex-math></inline-formula> receives the input sequence of <inline-formula> <tex-math>$B_{0}$ </tex-math></inline-formula>, and values of selected state variables are copied from <inline-formula> <tex-math>$B_{0}$ </tex-math></inline-formula> to <inline-formula> <tex-math>$B_{1}$ </tex-math></inline-formula>. A single synchronization period was used earlier. The first key contribution of this article is to introduce a retry-based synchronization process with multiple synchronization periods to avoid flagging synchronization failures as faults. The second contribution of this article is to develop the synchronization process in a simulation environment that considers functional operation conditions. Experimental results for benchmark circuits demonstrate the effectiveness of the retry-based process and the importance of the functional simulation environment.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1447-1451"},"PeriodicalIF":2.8,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kun Li;Hongji Fang;Zhenguo Ma;Feng Yu;Bo Zhang;Qianjian Xing
{"title":"Area-Efficient Pipeline Architecture for Serial Real-Valued Fast Fourier Transform","authors":"Kun Li;Hongji Fang;Zhenguo Ma;Feng Yu;Bo Zhang;Qianjian Xing","doi":"10.1109/TVLSI.2024.3496922","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496922","url":null,"abstract":"This brief presents a novel pipeline architecture designed to compute the fast Fourier transform (FFT) on real input signals in a serial format. This architecture significantly improves resource efficiency by sharing adders between butterfly and rotator structures. In addition, a novel data management approach for N-point radix-2 serial real-valued FFT (RFFT) has been proposed, which not only simplifies the data reordering circuit between processing elements (PEs) but also achieves natural order data output. The real-valued 1024-point FFT has been implemented on a field-programmable gate array (FPGA). Compared with typical real-valued serial commutator (RSC) FFT architecture, the proposed architecture achieves substantial improvement, including a reduction of 10.3% in the number of lookup tables (LUTs) and 12.5% in flip-flops (FFs).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1427-1431"},"PeriodicalIF":2.8,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 7.4–9.2-GHz Fractional-N Differential Sampling PLL Based on Phase-Domain and Voltage-Domain Hybrid Calibration","authors":"Feng Bu;Ruixue Ding;Depeng Sun;Ge Wang;Yuan Gao;Rong Zhou;Xiaoteng Zhao;Lisheng Chen;Shubin Liu;Zhangming Zhu","doi":"10.1109/TVLSI.2024.3496931","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496931","url":null,"abstract":"This brief proposes a 7.4–9.2-GHz low-noise fractional-N differential sampling phase-locked loop (DSPLL), which features doubled phase detector (PD) gain. By using the phase-domain and voltage-domain hybrid calibration, the accumulated quantization error (Q-error) of the delta-sigma modulator (DSM) is compensated, and the locking problem caused by large sampling voltage fluctuation is solved. Meanwhile, a voltage shifting technique is introduced to adjust the locked voltage region of differential sampling PD (DSPD), which can improve the linearity of DSPLL for better calibration. Fabricated in 65-nm CMOS process, the presented DSPLL achieves measured integrated jitter of 69.09 and 73.26 fs for integer-N and fractional-N modes, respectively. The reference spur is −72.96 dBc, and the worst fractional spur is −55.26 dBc. The total power consumption is 19.2 mW at a 1.2-V supply, achieving a figure of merit jitter (FOMJ) of −249.9 dB.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1442-1446"},"PeriodicalIF":2.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Won Joon Choi;Myungguk Lee;Junung Choi;Jaeik Cho;Gain Kim;Byungsub Kim
{"title":"An On-Chip Low-Cost Averaging Digital Sampling Scope for 80-GS/s Measurement of Wireline Pulse Responses","authors":"Won Joon Choi;Myungguk Lee;Junung Choi;Jaeik Cho;Gain Kim;Byungsub Kim","doi":"10.1109/TVLSI.2024.3497213","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3497213","url":null,"abstract":"Determining a channel’s characteristics is a fundamental step for designing a high-speed link system. By identifying the properties of the channel, designers can gain insights into how to transmit a signal with low distortion and optimize a transceiver’s architecture. As the channel’s characteristics can be identified by analyzing its single-bit pulse response (PR), obtaining an accurate PR plot is critical for reliable channel characterization. Therefore, it is preferred to measure the PR in situ to minimize the parasitic effects. In this work, we introduce a novel approach for measuring PR in situ, designed to quickly and accurately generate undistorted plot results. To prove the efficacy of the proposed method, we designed an on-chip sampling scope circuit and fabricated a test chip in 28-nm CMOS technology. While being able to measure a distortion-free PR, the proposed method demonstrates a more than <inline-formula> <tex-math>$10^{5}$ </tex-math></inline-formula> times faster pulse acquisition rate than prior arts.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1432-1436"},"PeriodicalIF":2.8,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV","authors":"Qian Chen;Xiaofeng Yang;Shengli Lu","doi":"10.1109/TVLSI.2024.3497166","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3497166","url":null,"abstract":"Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflows can be categorized into coarse and fine granularity. Coarse dataflows offer good spatial locality but suffer from low parallelism, while fine dataflows provide high parallelism but disrupt the spatial structure, leading to increased nodes and poor data reuse. This article proposes a novel hardware accelerator for SpTRSV or SpTRSV-like directed acyclic graphs (DAGs). The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. In addition, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intranode edges’ computation is developed to enhance data reuse. Experimental results on 245 benchmarks with node counts reaching up to 85392 demonstrate that this work achieves average performance improvements of <inline-formula> <tex-math>$7.0times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$27.8times $ </tex-math></inline-formula>) over CPUs and <inline-formula> <tex-math>$5.8times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$98.8times $ </tex-math></inline-formula>) over GPUs. Compared with the state-of-the-art technique (DPU-v2), this work shows a <inline-formula> <tex-math>$2.5times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$5.9times $ </tex-math></inline-formula>) average performance improvement and <inline-formula> <tex-math>$1.7times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$4.1times $ </tex-math></inline-formula>) average energy efficiency enhancement.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"807-820"},"PeriodicalIF":2.8,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiawei Cao;Chongtao Guo;Houjun Wang;Zhigang Wang;Hao Li;Geoffrey Ye Li
{"title":"Deep Learning-Based Performance Testing for Analog Integrated Circuits","authors":"Jiawei Cao;Chongtao Guo;Houjun Wang;Zhigang Wang;Hao Li;Geoffrey Ye Li","doi":"10.1109/TVLSI.2024.3496777","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496777","url":null,"abstract":"In this brief, we propose a deep learning-based performance testing framework to minimize the number of required test modules while guaranteeing the accuracy requirement, where a test module corresponds to a combination of one circuit and one stimulus. First, we apply a deep neural network (DNN) to establish the mapping from the response of the circuit under test (CUT) in each module to all specifications to be tested. Then, the required test modules are selected by solving a 0–1 integer programming problem. Finally, the predictions from the selected test modules are combined by a DNN to form the specification estimations. The simulation results validate the proposed approach in terms of testing accuracy and cost.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"1187-1191"},"PeriodicalIF":2.8,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143667705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Li;Chuanlun Zhang;Wenxuan Yang;Heng Li;Xiaoyan Wang;Chuanjun Zhao;Shuangli Du;Yiguang Liu
{"title":"FPGA-Based Low-Bit and Lightweight Fast Light Field Depth Estimation","authors":"Jie Li;Chuanlun Zhang;Wenxuan Yang;Heng Li;Xiaoyan Wang;Chuanjun Zhao;Shuangli Du;Yiguang Liu","doi":"10.1109/TVLSI.2024.3496751","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496751","url":null,"abstract":"The 3-D vision computing is a key application in unmanned systems, satellites, and planetary rovers. Learning-based light field (LF) depth estimation is one of the major research directions in 3-D vision computing. However, conventional learning-based depth estimation methods involve a large number of parameters and floating-point operations, making it challenging to achieve low-power, fast, and high-precision LF depth estimation on a field-programmable gate array (FPGA). Motivated by this issue, an FPGA-based low-bit, lightweight LF depth estimation network (L\u0000<inline-formula> <tex-math>$^{3}text {FNet}$ </tex-math></inline-formula>\u0000) is proposed. First, a hardware-friendly network is designed, which has small weight parameters, low computational load, and a simple network architecture with minor accuracy loss. Second, we apply efficient hardware unit design and software-hardware collaborative dataflow architecture to construct an FPGA-based fast, low-bit acceleration engine. Experimental results show that compared with the state-of-the-art works with lower mean-square error (mse), L\u0000<inline-formula> <tex-math>$^{3}text {FNet}$ </tex-math></inline-formula>\u0000 can reduce the computational load by more than 109 times and weight parameters by approximately 78 times. Moreover, on the ZCU104 platform, it requires 95.65% lookup tables (LUTs), 80.67% digital signal processors (DSPs), 80.93% BlockRAM (BRAM), 58.52% LUTRAM, and 9.493-W power consumption to achieve an efficient acceleration engine with a latency as low as 272 ns. The code and model of the proposed method are available at \u0000<uri>https://github.com/sansi-zhang/L3FNet</uri>\u0000.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"88-101"},"PeriodicalIF":2.8,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed M. Mohey;Jelin Leslin;Gaurav Singh;Marko Kosunen;Jussi Ryynänen;Martin Andraud
{"title":"A 22-nm All-Digital Time-Domain Neural Network Accelerator for Precision In-Sensor Processing","authors":"Ahmed M. Mohey;Jelin Leslin;Gaurav Singh;Marko Kosunen;Jussi Ryynänen;Martin Andraud","doi":"10.1109/TVLSI.2024.3496090","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496090","url":null,"abstract":"Deep neural network (DNN) accelerators are increasingly integrated into sensing applications, such as wearables and sensor networks, to provide advanced in-sensor processing capabilities. Given wearables’ strict size and power requirements, minimizing the area and energy consumption of DNN accelerators is a critical concern. In that regard, computing DNN models in the time domain is a promising architecture, taking advantage of both technology scaling friendliness and efficiency. Yet, time-domain accelerators are typically not fully digital, limiting the full benefits of time-domain computation. In this work, we propose an all-digital time-domain accelerator with a small size and low energy consumption to target precision in-sensor processing like human activity recognition (HAR). The proposed accelerator features a simple and efficient architecture without dependencies on analog nonidealities such as leakage and charge errors. An eight-neuron layer (core computation layer) is implemented in 22-nm FD-SOI technology. The layer occupies \u0000<inline-formula> <tex-math>$70 times ,70,mu $ </tex-math></inline-formula>\u0000m while supporting multibit inputs (8-bit) and weights (8-bit) with signed accumulation up to 18 bits. The power dissipation of the computation layer is 576\u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000W at 0.72-V supply and 500-MHz clock frequency achieving an average area efficiency of 24.74 GOPS/mm2 (up to 544.22 GOPS/mm2), an average energy efficiency of 0.21 TOPS/W (up to 4.63 TOPS/W), and a normalized energy efficiency of 13.46 1b-TOPS/W (up to 296.30 1b-TOPS/W).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2220-2231"},"PeriodicalIF":2.8,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}