{"title":"P2-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer","authors":"Huihong Shi;Xin Cheng;Wendong Mao;Zhongfeng Wang","doi":"10.1109/TVLSI.2024.3422684","DOIUrl":"10.1109/TVLSI.2024.3422684","url":null,"abstract":"Vision transformers (ViTs) have excelled in computer vision (CV) tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield nonnegligible requantization overhead, limiting ViTs’ hardware efficiency and motivating more hardware-friendly solutions. To this end, we propose P2-ViT, the first power-of-two (PoT) posttraining quantization (PTQ) and acceleration framework to accelerate fully quantized ViTs. Specifically, as for quantization, we explore a dedicated quantization scheme to effectively quantize ViTs with PoT scaling factors, thus minimizing the requantization overhead. Furthermore, we propose coarse-to-fine automatic mixed-precision quantization to enable better accuracy-efficiency tradeoffs. In terms of hardware, we develop a dedicated chunk-based accelerator featuring multiple tailored subprocessors to individually handle ViTs’ different types of operations, alleviating reconfigurable overhead. In addition, we design a tailored row-stationary dataflow to seize the pipeline processing opportunity introduced by our PoT scaling factors, thereby enhancing throughput. Extensive experiments consistently validate P2-ViT’s effectiveness. Particularly, we offer comparable or even superior quantization performance with PoT scaling factors when compared with the counterpart with floating-point scaling factors. Besides, we achieve up to \u0000<inline-formula> <tex-math>$10.1times $ </tex-math></inline-formula>\u0000 speedup and \u0000<inline-formula> <tex-math>$36.8times $ </tex-math></inline-formula>\u0000 energy saving over GPU’s Turing Tensor Cores, and up to \u0000<inline-formula> <tex-math>$1.84times $ </tex-math></inline-formula>\u0000 higher computation utilization efficiency against SOTA quantization-based ViT accelerators. Codes are available at \u0000<uri>https://github.com/shihuihong214/P2-ViT</uri>\u0000.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141612041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FELIX: FPGA-Based Scalable and Lightweight Accelerator for Large Integer Extended GCD","authors":"Samuel Coulon;Tianyou Bao;Jiafeng Xie","doi":"10.1109/TVLSI.2024.3417016","DOIUrl":"10.1109/TVLSI.2024.3417016","url":null,"abstract":"The extended greatest common divisor (XGCD) computation is a critical component in various cryptographic applications and algorithms, including both pre- and postquantum cryptosystems. In addition to computing the greatest common divisor (GCD) of two integers, the XGCD also produces Bézout coefficients \u0000<inline-formula> <tex-math>$b_{a}$ </tex-math></inline-formula>\u0000 and \u0000<inline-formula> <tex-math>$b_{b}$ </tex-math></inline-formula>\u0000 which satisfy \u0000<inline-formula> <tex-math>$mathrm {GCD}(a,b) = atimes b_{a} + btimes b_{b}$ </tex-math></inline-formula>\u0000. In particular, computing the XGCD for large integers is of significant interest. Most recently, XGCD computation between 6479-bit integers is required for solving Nth-degree truncated polynomial ring unit (NTRU) trapdoors in Falcon, a National Institute of Standards and Technology (NIST)-selected postquantum digital signature scheme. To this point, existing literature has primarily focused on exploring software-based implementations for XGCD. The few existing high-performance hardware architectures require significant hardware resources and may not be desirable for practical usage, and the lightweight architectures suffer from poor performance. To fill the research gap, this work proposes a novel FPGA-based scalable and lightweight accelerator for large integer XGCD (FELIX). First, a new algorithm suitable for scalable and lightweight computation of XGCD is proposed. Next, a hardware accelerator (FELIX) is presented, including both constant- and variable-time versions. Finally, a thorough evaluation is carried out to showcase the efficiency of the proposed FELIX. In certain configurations, FELIX involves 81% less equivalent area-time product (eATP) than the state-of-the-art design for 1024-bit integers, and achieves a 95% reduction in latency over the software for 6479-bit integers (Falcon parameter set) with reasonable resource usage. Overall, the proposed FELIX is highly efficient, scalable, lightweight, and suitable for very large integer computation, making it the first such XGCD accelerator in the literature (to the best of our knowledge).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10593812","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141585201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Low-Power Co-Processor to Predict Ventricular Arrhythmia for Wearable Healthcare Devices","authors":"Meenali Janveja;Rushik Parmar;Srichandan Dash;Jan Pidanic;Gaurav Trivedi","doi":"10.1109/TVLSI.2024.3413584","DOIUrl":"10.1109/TVLSI.2024.3413584","url":null,"abstract":"Ventricular arrhythmia (VA) is the most critical cardiac anomaly among all arrhythmia beats. Thus, it becomes imperative to predict the occurrence of VA to avoid sudden casualties caused by these arrhythmia beats. In the past, only a few hardware designs have been proposed to predict VA using various features derived from electrocardiogram (ECG) signals and processed using machine learning classifiers. However, these designs are either complex or need more prediction accuracy. Therefore, a deep neural network (DNN)-based co-processor for arrhythmia prediction is proposed in this article. It can predict VA at least \u0000<inline-formula> <tex-math>$15 min $ </tex-math></inline-formula>\u0000 before its occurrence with 91.6% accuracy. Co-processor architecture for arrhythmia prediction (CoAP) uses an optimal feature vector extracted from the ECG signal and an optimized DNN, using a novel approximate multiplier (AM). CoAP operates at 12.5 kHz and consumes \u0000<inline-formula> <tex-math>$4.69~mu text { W}$ </tex-math></inline-formula>\u0000 when implemented using SCL \u0000<inline-formula> <tex-math>$180text {-nm}$ </tex-math></inline-formula>\u0000 bulk CMOS technology. The low power realization of the proposed design and its higher accuracy, compared with well-known state-of-the-art methods, make it suitable for wearable devices.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141568490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Area-Efficient Systolic Array Redundancy Architecture for Reliable AI Accelerator","authors":"Hayoung Lee, Jongho Park, Sungho Kang","doi":"10.1109/tvlsi.2024.3421563","DOIUrl":"https://doi.org/10.1109/tvlsi.2024.3421563","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141568492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Secure Edge-Coded Signaling IoT Transceiver With Reduced Encryption Overhead","authors":"Mizan Abraha Gebremicheal;Ibrahim M. Elfadel","doi":"10.1109/TVLSI.2024.3418713","DOIUrl":"10.1109/TVLSI.2024.3418713","url":null,"abstract":"The edge-coded signaling (ECS) protocol enables single-wire signaling in IoT devices and sensors using two important neuromorphic attributes. The first is the coding of bits as a stream of pulses (spikes), and the second is the circumvention of clock and data recovery (CDR) at the receiver. In addition, ECS can be endowed with strong, yet lightweight, security features using an ultralow-latency version of the A5/1 stream cipher. Such strong security comes at the expense of decreased data rates and significant area overhead. In this article, we introduce a new generation of secure ECS protocols that incorporates two notable improvements. The first is a more compact pulse stream definition that results in improved data rates for the plain ECS protocol. The second is a coding-aware version of the low-latency A5/1 stream cipher that results in minimal impact on the effective data rate of the transmission. Consequently, a new all-digital and secure ECS transceiver design is proposed, prototyped, and functionally verified in 65-nm technology. Compared with previous generations of secure ECS transceivers, this new design achieves an increase of approximately 138%, 199%, and 640% in minimum, average, and maximum data rates, respectively, and results in increased resiliency against brute-force attacks by a factor of 16. Furthermore, the ASIC implementation shows that it maintains the compact and energy-efficient features of the ECS architecture, using only \u0000<inline-formula> <tex-math>$28~mu $ </tex-math></inline-formula>\u0000W with an average energy efficiency of 2.745 pJ/bit and a gate count of approximately 2880 gates. This is more than 40% decrease in the equivalent gate count relative to the previous secure ECS generation.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141568491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A μ-GA Oriented ANN-Driven: Parameter Extraction of 5G CMOS Power Amplifier","authors":"Tahesin Samira Delwar;Abrar Siddique;Unal Aras;Yangwon Lee;Jee Youl Ryu","doi":"10.1109/TVLSI.2024.3414584","DOIUrl":"10.1109/TVLSI.2024.3414584","url":null,"abstract":"This article introduces a novel method for extracting crucial parameters from a fifth-generation (5G) CMOS power amplifier (PA) operating at 24 GHz. The proposed method, micro-genetic algorithm artificial neural network (\u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000-GAANN), presents an innovative synergy between \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000-GA and ANN, enabling the accurate determination of crucial PA (circuit components) parameters. The \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000-GAANN model has a fixed and robust stimulation function (\u0000<inline-formula> <tex-math>${F} {_{text {SF}}}$ </tex-math></inline-formula>\u0000 and \u0000<inline-formula> <tex-math>${R} {_{text {SF}}}$ </tex-math></inline-formula>\u0000). ANNs are trained to approximate the parameter extraction process based on input-output data generated from the \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000-GA. The proposed \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000-GA incorporates the arithmetic crossover and nonuniform mutation; thus, several parameters of the ANN network are tweaked. Moreover, ANN parameters are enhanced by using \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000-GA to achieve an optimal PA design in a shorter period of time. To verify the proposed \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000-GAANN, we have also compared the training time with particle swarm optimization (PSO) employed in ANN, i.e., PSOANN. Besides, a derivative superposition (DS) linearization technique is used in the PA circuit, along with input load splits (I-LSs) to solve the low input impedance problem of conventional DS. To design a PA, the proposed \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000-GAANN outperforms the traditional feedforward artificial neural networks (TFFANN). Using \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000-GAANN, the PA’s simulated S21 is 25 dB, while the measured S21 is 21.2 dB. With traditional TFFANN, we observe a simulated gain of 24.1 dB for the PA. We achieved a simulated gain of 23.2 dB of the PA without using ANNs. The measured results of the \u0000<inline-formula> <tex-math>$P {_{text {sat}}}$ </tex-math></inline-formula>\u0000 and PAE of the PA with \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000-GAANN are 9.8 dBm and 32.1%, respectively. Also, a measured PA achieves a high third-order-input-intercept point (IIP3) of 14.1 dBm. The core chip area of the PA is 0.35 mm2.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141519144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}