{"title":"A 36-Gb/s 2× Half-Baud-Rate Adaptive Receiver in 28-nm CMOS","authors":"Yi-Hao Lan;Shen-Iuan Liu","doi":"10.1109/TVLSI.2024.3392680","DOIUrl":"10.1109/TVLSI.2024.3392680","url":null,"abstract":"A 36-Gb/s \u0000<inline-formula> <tex-math>$2times $ </tex-math></inline-formula>\u0000 half-baud-rate (THBR) adaptive receiver (RX) is presented. The pattern-based adaptation method for adjusting the frequency response of a continuous-time linear equalizer (CTLE) is proposed. In addition, the reference voltage of the comparators is adapted to enhance the timing margin of the recovered clock in the RX. This THBR adaptive RX is fabricated in TSMC 28-nm CMOS technology with a core area of 0.097 mm2. The measured bit error rate (BER) is less than 10−12 for a 36-Gb/s pseudorandom binary sequence (PRBS) of 27 – 1, when the channel loss is 19 dB at 18 GHz. The total power consumption of this RX is 76 mW with gated adaptation circuits. The calculated figure of merit (FoM) is 2.1 pJ/bit.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siji Huang;Debajit Basak;Yanhang Chen;Qifeng Huang;Yifei Fan;Jie Yuan
{"title":"An Efficient 1.4-GS/s 10-bit Timing-Skew-Free Time-Interleaved SAR ADC With a Centralized Sampling Frontend","authors":"Siji Huang;Debajit Basak;Yanhang Chen;Qifeng Huang;Yifei Fan;Jie Yuan","doi":"10.1109/TVLSI.2024.3392611","DOIUrl":"10.1109/TVLSI.2024.3392611","url":null,"abstract":"This article presents a timing-skew-free time-interleaved (TI) successive-approximation register (SAR) analog-to-digital converter (ADC). By implementing an architecture with a single sample-and-hold (S/H) network, this design eliminates the need for a costly timing-skew calibration. Additionally, compared to architectures that utilize multiple S/H networks, it offers hardware and power savings. As a result, the proposed design is efficient in terms of energy and area, making it suitable for applications that require multiple ADC channels. A prototype ADC is designed and fabricated in a 28-nm CMOS process. The TI SAR ADC, running at 1.4 GS/s, achieves a signal-to-noise-and-distortion ratio (SNDR) and spurious free dynamic range (SFDR) of 48.1 and 58.4 dB with a Nyquist input, respectively. It dissipates 24 mW, leading to a Walden figure-of-merit (FoM) of 82.4 fJ/conv.-step. The chip occupies an active area of 0.06 mm2.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Amoeba: An Efficient and Flexible FPGA-Based Accelerator for Arbitrary-Kernel CNNs","authors":"Xiao Wu;Miaoxin Wang;Jun Lin;Zhongfeng Wang","doi":"10.1109/TVLSI.2024.3383871","DOIUrl":"10.1109/TVLSI.2024.3383871","url":null,"abstract":"Inspired by the key operation of vision transformers (ViTs), convolutional neural networks (CNNs) have widely adopted arbitrary-kernel convolutions to achieve high performance in diverse vision-based tasks. However, existing hardware efforts primarily focus on implementing CNN models that consist of a stack of small kernels, which poses challenges in supporting large-kernel convolutions. To address this limitation, we propose Amoeba, a flexible field-programmable gate array (FPGA)-based inference accelerator designed for efficiently supporting CNNs with arbitrary kernel sizes. Specifically, we present an optimized dataflow approach in collaboration with the Z-flow method and kernel-segmentation (Kseg) scheme, which enables flexible support for arbitrary-kernel convolutions without sacrificing efficiency. Additionally, we incorporate vertical-fused (VF) and horizontal-fused (HF) methods into the layer execution schedule to optimize the computation and data transfer process. To further enhance the CNN deployment performance, we employ the loop tiling scheme search (LTSS) method, guided by a fine-grained performance model, during the early design phase. The proposed Amoeba accelerator is evaluated on Intel Arria 10 SoC FPGA. The experimental results demonstrate excellent performance on prevalent and emerging CNNs, achieving a throughput of up to 286.2 GOPs. Notably, Amoeba achieves \u0000<inline-formula> <tex-math>$4.36times $ </tex-math></inline-formula>\u0000 better DSP efficiency compared to prior works on the same network, highlighting its superior utilization of hardware resources for CNN inference tasks.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCAR: Power Side-Channel Analysis at RTL Level","authors":"Amisha Srivastava;Sanjay Das;Navnil Choudhury;Rafail Psiakis;Pedro Henrique Silva;Debjit Pal;Kanad Basu","doi":"10.1109/TVLSI.2024.3390601","DOIUrl":"10.1109/TVLSI.2024.3390601","url":null,"abstract":"Power side-channel (PSC) attacks exploit the dynamic power consumption of cryptographic operations to leak sensitive information about encryption hardware. Therefore, it is necessary to conduct a PSC analysis to assess the susceptibility of cryptographic systems and mitigate potential risks. Existing PSC analysis primarily focuses on postsilicon implementations, which are inflexible in addressing design flaws, leading to costly and time-consuming postfabrication design re-spins. Hence, presilicon PSC analysis is required for the early detection of vulnerabilities to improve design robustness. In this article, we introduce SCAR, a novel presilicon PSC analysis framework based on graph neural networks (GNNs). SCAR converts register-transfer level (RTL) designs of encryption hardware into control-data flow graphs (CDFGs) and use that to detect the design modules susceptible to side-channel leakage. Furthermore, we incorporate a deep-learning-based explainer in SCAR to generate quantifiable and human-accessible explanations of our detection and localization decisions. We have also developed a fortification component as a part of SCAR that uses large-language models (LLMs) to automatically generate and insert additional design code at the localized zone to shore up the side-channel leakage. When evaluated on popular encryption algorithms like advanced encryption standard (AES), RSA, and PRESENT, and postquantum cryptography (PQC) algorithms like Saber and CRYSTALS-Kyber, SCAR, achieves up to 94.49% localization accuracy, 100% precision, and 90.48% recall. Additionally, through explainability analysis, SCAR reduces features for GNN model training by 57% while maintaining comparable accuracy. We believe that SCAR will transform the security-critical hardware design cycle, resulting in faster design closure at a reduced design cost.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Another Look at Side-Channel-Resistant Encoding Schemes","authors":"Xiaolu Hou;Jakub Breier;Mladen Kovačević","doi":"10.1109/TVLSI.2024.3390614","DOIUrl":"10.1109/TVLSI.2024.3390614","url":null,"abstract":"The idea of balancing the side-channel leakage in software was proposed more than a decade ago. Just like with other hiding-based countermeasures, the goal is not to hide the leakage completely but to significantly increase the effort required for the attack. Previous approaches focused on two directions: either balancing the Hamming weight of the processed data or deriving the code by using stochastic leakage profiling. In this brief, we build upon these results by proposing a novel approach that combines the two directions. We provide the theory behind our encoding scheme backed by experimental results on a 32-bit ARM Cortex-M4 microcontroller. Our results show that such a combination gives better side-channel resistance properties than each of the two methods separately.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information","authors":"","doi":"10.1109/TVLSI.2024.3380313","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3380313","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10508546","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140647924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information","authors":"","doi":"10.1109/TVLSI.2024.3380315","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3380315","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10508548","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140648023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual-Rail Precharge Logic-Based Side-Channel Countermeasure for DNN Systolic Array","authors":"Le Wu;Liji Wu;Xiangmin Zhang;Munkhbaatar Chinbat","doi":"10.1109/TVLSI.2024.3387986","DOIUrl":"10.1109/TVLSI.2024.3387986","url":null,"abstract":"Deep neural network (DNN) accelerators are widely used in cloud-edge-end and other application scenarios. Researchers recently focused on extracting secret information from DNN through side-channel attacks (SCAs), which substantially threaten AI security. In this brief, we propose a high-security, high-performance side-channel countermeasure using dual-rail precharge logic (DPL) for the DNN systolic array. By collecting and analyzing 5000 power traces, our proposed DPL-based systolic array provides a significantly lower correlation coefficient of 0.045. Through system-level side-channel security evaluation on field-programmable gate arrays (FPGAs), the DPL-based systolic array can effectively defend against weight extraction under power SCAs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-Complexity VLSI Architecture for OTFS Transceiver Under Multipath Fading Channel","authors":"Ashish Ranjan Shadangi;Suvra Sekhar Das;Indrajit Chakrabarti","doi":"10.1109/TVLSI.2024.3384114","DOIUrl":"10.1109/TVLSI.2024.3384114","url":null,"abstract":"Orthogonal time frequency space (OTFS) modulation has established itself as a dependable protocol for high-speed vehicular communication. This pioneering technique operates within a novel 2-D delay-Doppler domain waveform. When compared with conventional modulation methods like orthogonal frequency-division multiplexing (OFDM), OTFS demonstrates superior performance enhancements in scenarios involving rapidly moving wireless channels. This article begins by initially unveiling the input–output association of the OTFS signal within the delay-time domain. A comprehensive comparison with the established OFDM waveform highlights the potential of OTFS for achieving a notably lower bit error rate (BER) under various conditions, which has been obtained by using the minimum mean square equalizer (MMSE) equalization technique. Finally, we have proposed a novel and low-complexity VLSI architecture for the OTFS transmitter and the receiver by using the lower–upper (LU) decomposition technique for the first time in the literature. We have compared the performance metrics of our proposed transmitter architecture with the existing work, where our design works 7.394% faster than others, utilizing 89.354% less in the number of lookup tables (LUTs) and 79.984% less in the number of flip-flops (FFs), which shows that our design is more optimized in latency and resource utilization. There is no architecture design of the OTFS receiver part in the existing literature to compare; we have shown the resource utilization of our proposed receiver architecture for the first time in the literature, followed by timing analysis and functionality testing of the proposed architecture.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianyang Yu;Bi Wu;Ke Chen;Chenggang Yan;Weiqiang Liu
{"title":"Toward Efficient Retraining: A Large-Scale Approximate Neural Network Framework With Cross-Layer Optimization","authors":"Tianyang Yu;Bi Wu;Ke Chen;Chenggang Yan;Weiqiang Liu","doi":"10.1109/TVLSI.2024.3386900","DOIUrl":"10.1109/TVLSI.2024.3386900","url":null,"abstract":"Leveraging approximate multipliers in approximate neural networks (ApproxNNs) can effectively reduce hardware area and power consumption, making them suitable for edge-side applications. However, the propagation of layer-by-layer errors limits the application of approximate multipliers to large-scale ApproxNNs and complex tasks. Currently, retraining techniques that consider approximate multiplication errors are commonly used to compensate for the accuracy loss. However, due to the irregularity of the errors introduced by approximate multiplier, it is difficult for the existing generic acceleration hardware (e.g., GPU) to efficiently simulate its function and accelerate retraining, which thereby leads to a huge retraining overhead in ApproxNNs’ application. In this article, we propose an ApproxNN framework that introduces errors with regular and controlled positions for high-efficiency retraining of large-scale ApproxNNs. An approximate multiplier design that matches this framework is also presented to verify the effectiveness of the proposed ApproxNN framework. Experiment results demonstrate that the proposed ApproxNN framework is able to achieve up to \u0000<inline-formula> <tex-math>$46times $ </tex-math></inline-formula>\u0000 speedup in retraining, and the proposed approximate multiplier reduces area/power-delay product (PDP) by 31%/63% compared to the exact multiplier. Compared with the floating-point neural network (NN) model, an accuracy decrease of only 1.13% is achieved when applied to ResNet50 on ImageNet dataset with only 15-epochs retraining, which surpasses other state-of-the-art designs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140611051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}