{"title":"FasDL: An Efficient Serverless-Based Training Architecture With Communication Optimization and Resource Configuration","authors":"Xinglei Chen;Zinuo Cai;Hanwen Zhang;Ruhui Ma;Rajkumar Buyya","doi":"10.1109/TC.2024.3485202","DOIUrl":"https://doi.org/10.1109/TC.2024.3485202","url":null,"abstract":"Deploying distributed training workloads of deep learning models atop serverless architecture alleviates the burden of managing servers from deep learning practitioners. However, when supporting deep model training, the current serverless architecture faces the challenges of inefficient communication patterns and rigid resource configuration that incur subpar and unpredictable training performance. In this paper, we propose <bold>FasDL</b>, an efficient serverless-based deep learning training architecture to solve these two challenges. <bold>FasDL</b> adopts a novel training framework <monospace>K-REDUCE</monospace> to release the communication overhead and accelerate the training. Additionally, FasDL builds a lightweight mathematical model for <monospace>K-REDUCE</monospace> training, offering predictable performance and supporting subsequent resource configuration. It achieves the optimal resource configuration by formulating an optimization problem related to system-level and application-level parameters and solving it with a pruning-based heuristic search algorithm. Extensive experiments on AWS Lambda verify a prediction accuracy over 94% and demonstrate performance and cost advantages over the state-of-art architecture LambdaML by up to 16.8% and 28.3% respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"468-482"},"PeriodicalIF":3.6,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiguo Li;Licheng Ji;Yicheng Zhang;Yang Lu;Jianting Ning
{"title":"Response-Hiding and Volume-Hiding Verifiable Searchable Encryption With Conjunctive Keyword Search","authors":"Jiguo Li;Licheng Ji;Yicheng Zhang;Yang Lu;Jianting Ning","doi":"10.1109/TC.2024.3485172","DOIUrl":"https://doi.org/10.1109/TC.2024.3485172","url":null,"abstract":"Verifiable searchable encryption (VSE) not only allows the client to search encrypted data, but also allows the client to verify whether the server honestly executes search operations. Currently, VSE scheme has been widely studied in cloud storage. However, most existing VSE schemes did not hide the access pattern and volume pattern, which respectively refer to the document identifiers and the number of documents matching the queried keywords. Recent studies have exploited these two patterns to launch attacks on searchable encryption schemes, resulting in compromising the confidentiality of encrypted data and queried keywords. In order to solve above issues, we utilize additively symmetric homomorphic encryption scheme and private set intersection protocol to construct a VSE scheme that supports conjunctive keyword search and hides the access pattern and volume pattern (i.e., response-hiding and volume-hiding). Our security model assumes that the server is malicious in the sense that it might deliberately carry out incorrect search operations. Formal security analysis demonstrates that our scheme achieves the desired security properties under our leakage function. Compared to previous schemes, our scheme has advantages in terms of performance and functionality. In an experimental setup with a security parameter of 128 bits and <inline-formula><tex-math>$2^{23}$</tex-math></inline-formula> keyword/document pairs, the search time is approximately only 7.18 seconds.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"455-467"},"PeriodicalIF":3.6,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengke Ge;Junpeng Wang;Binhan Chen;Yingjian Zhong;Haitao Du;Song Chen;Yi Kang
{"title":"Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems","authors":"Mengke Ge;Junpeng Wang;Binhan Chen;Yingjian Zhong;Haitao Du;Song Chen;Yi Kang","doi":"10.1109/TC.2024.3483633","DOIUrl":"https://doi.org/10.1109/TC.2024.3483633","url":null,"abstract":"The advent of Transformers has revolutionized computer vision, offering a powerful alternative to convolutional neural networks (CNNs), especially with the local attention mechanism that excels at capturing local structures within the input and achieve state-of-the-art performance. Processing in-memory (PIM) architecture offers extensive parallelism, low data movement costs, and scalable memory bandwidth, making it a promising solution to accelerate Transformer with memory-intensive operations. However, the crucial issue lies in efficiently deploying an entire model onto resource-limited PIM system while parallelizing each transformer block with potentially many computational branches based on local-attention mechanisms. We present Allspark, which focuses on workload orchestration for visual Transformers on PIM systems, aiming at minimizing inference latency. Firstly, to fully utilize the massive parallelism of PIM, Allspark employs a fine-grained partitioning scheme for computational branches, and formats a systematic layout and interleaved dataflow with maximized data locality and reduced data movement. Secondly, Allspark formulates the scheduling of the complete model on a resource-limited distributed PIM system as an integer linear programming (ILP) problem. Thirdly, as local-global data interactions exhibit complex yet regular dependencies, Allspark provides a two-stage placement method, which simplifies the challenging placement of computational branches on the PIM system into the structured layout and greedy-based binding, to minimize NoC communication costs. Extensive experiments on 3D-stacked DRAM-based PIM systems show that Allspark brings <inline-formula><tex-math>$1.2times$</tex-math></inline-formula> <inline-formula><tex-math>$sim$</tex-math></inline-formula> <inline-formula><tex-math>$24.0times$</tex-math></inline-formula> inference speedup for various visual Transformers over baselines. Compared to Nvidia V100 GPU, Allspark-enriched PIM system yields average speedups of <inline-formula><tex-math>$2.3times$</tex-math></inline-formula> and energy savings of <inline-formula><tex-math>$20times$</tex-math></inline-formula> <inline-formula><tex-math>$sim$</tex-math></inline-formula> <inline-formula><tex-math>$55times$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"427-441"},"PeriodicalIF":3.6,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bit-Sparsity Aware Acceleration With Compact CSD Code on Generic Matrix Multiplication","authors":"Zixuan Zhu;Xiaolong Zhou;Chundong Wang;Li Tian;Zunkai Huang;Yongxin Zhu","doi":"10.1109/TC.2024.3483632","DOIUrl":"https://doi.org/10.1109/TC.2024.3483632","url":null,"abstract":"The ever-increasing demand for matrix multiplication in artificial intelligence (AI) and generic computing emphasizes the necessity of efficient computing power accommodating both floating-point (FP) and quantized integer (QINT). While state-of-the-art bit-sparsity-aware acceleration techniques have demonstrated impressive performance and efficiency in neural networks through software-driven methods such as pruning and quantization, these approaches are not always feasible in typical generic computing scenarios. In this paper, we propose Bit-Cigma, a hardware-centric architecture that leverages bit-sparsity to accelerate generic matrix multiplication. Bit-Cigma features (1) CCSD encoding, an optimized on-chip sparsification technique based on canonical signed digit (CSD) representation; (2) segmented dot product, a multi-stage exponent matching technique for long FP vectors; and (3) the versatility to efficiently process both FP and QINT data types. CCSD encoding halves the cost of CSD encoding while achieving optimal bit-sparsity, and segmented dot product improves both accuracy and throughput. Bit-Cigma cores are implemented using 65 nm technology at 1 GHz, demonstrating substantial gains in performance and efficiency for both FP and QINT configurations. Compared to state-of-the-art Bitlet, Bit-Cigma achieves 3.2<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> performance, 6.1<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> area efficiency, and 15.3<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> energy efficiency when processing FP32 data while ensuring zero computing error.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"414-426"},"PeriodicalIF":3.6,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multi-UAV Cooperative Task Scheduling in Dynamic Environments: Throughput Maximization","authors":"Liang Zhao;Shuo Li;Zhiyuan Tan;Ammar Hawbani;Stelios Timotheou;Keping Yu","doi":"10.1109/TC.2024.3483636","DOIUrl":"https://doi.org/10.1109/TC.2024.3483636","url":null,"abstract":"Unmanned aerial vehicle (UAV) has been considered a promising technology for advancing terrestrial mobile computing in the dynamic environment. In this research field, throughput, the number of completed tasks and latency are critical evaluation indicators used to measure the efficiency of UAVs in existing studies. In this paper, we transform these metrics to a single optimization objective, i.e., throughput maximization. To maximize the throughput, we consider realizing this goal in two respects. The first is to adapt the formation of the UAVs to provide cooperative computing service in a dynamic environment, we integrate a policy-based gradient algorithm and the task factorization network as a new reinforcement learning algorithm to improve the cooperation of UAVs. The second is to optimize the association process between UAVs and users, where the heterogeneity of tasks is considered. This algorithm is modified from the Gale-Shapley stability concept to optimize the appropriate association between tasks and UAVs in a dynamic time-varying condition to get the near-optimal association with few iterations. The scheduling of dependent tasks and independent tasks jointly also has to be considered. Finally, simulation results demonstrate the improvement of cooperation performance and the practicability of the association process.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"442-454"},"PeriodicalIF":3.6,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyi Ji;Jiankuo Dong;Junhao Huang;Zhijian Yuan;Wangchen Dai;Fu Xiao;Jingqiang Lin
{"title":"ECO-CRYSTALS: Efficient Cryptography CRYSTALS on Standard RISC-V ISA","authors":"Xinyi Ji;Jiankuo Dong;Junhao Huang;Zhijian Yuan;Wangchen Dai;Fu Xiao;Jingqiang Lin","doi":"10.1109/TC.2024.3483631","DOIUrl":"https://doi.org/10.1109/TC.2024.3483631","url":null,"abstract":"The field of post-quantum cryptography (PQC) is continuously evolving. Many researchers are exploring efficient PQC implementation on various platforms, including x86, ARM, FPGA, GPU, etc. In this paper, we present an Efficient CryptOgraphy CRYSTALS (ECO-CRYSTALS) implementation on standard 64-bit RISC-V Instruction Set Architecture (ISA). The target schemes are two winners of the National Institute of Standards and Technology (NIST) PQC competition: CRYSTALS-Kyber and CRYSTALS-Dilithium, where the two most time-consuming operations are Keccak and polynomial multiplication. Notably, this paper is the first highly-optimized assembly software implementation to deploy Kyber and Dilithium on the 64-bit RISC-V ISA. Firstly, we propose a better scheduling strategy for Keccak, which is specifically tailored for the 64-bit dual-issue RISC-V architecture. Our 24-round Keccak permutation (Keccak-<inline-formula><tex-math>$p$</tex-math></inline-formula>[1600,24]) achieves a 59.18% speed-up compared to the reference implementation. Secondly, we apply two modular arithmetic (Montgomery arithmetic and Plantard arithmetic) in the polynomial multiplication of Kyber and Dilithium to get a better lazy reduction. Then, we propose a flexible dual-instruction-issue scheme of Number Theoretic Transform (NTT). As for the matrix-vector multiplication, we introduce a row-to-column processing methodology to minimize the expensive memory access operations. Compared to the reference implementation, we obtain a speedup of 53.85%<inline-formula><tex-math>$thicksim$</tex-math></inline-formula>85.57% for NTT, matrix-vector multiplication, and INTT in our ECO-CRYSTALS. Finally, the ECO-CRYSTALS implementation for key generation, encapsulation, and decapsulation in Kyber achieves 399k, 448k, and 479k cycles respectively, achieving speedups of 60.82%, 63.93%, and 65.56% compared to the NIST reference implementation. Similarly, the ECO-CRYSTALS implementation for key generation, sign, and verify in Dilithium reaches 1 364k, 3 191k, and 1 369k cycles, showcasing speedups of 54.84%, 64.98%, and 57.20%, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"401-413"},"PeriodicalIF":3.6,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arne Symons;Linyan Mei;Steven Colleman;Pouya Houshmand;Sebastian Karl;Marian Verhelst
{"title":"Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators","authors":"Arne Symons;Linyan Mei;Steven Colleman;Pouya Houshmand;Sebastian Karl;Marian Verhelst","doi":"10.1109/TC.2024.3477938","DOIUrl":"https://doi.org/10.1109/TC.2024.3477938","url":null,"abstract":"As the landscape of deep neural networks evolves, heterogeneous dataflow accelerators, in the form of multi-core architectures or chiplet-based designs, promise more flexibility and higher inference performance through scalability. So far, these systems exploit the increased parallelism by coarsely mapping a single layer at a time across cores, which incurs frequent costly off-chip memory accesses, or by pipelining batches of inputs, which falls short in meeting the demands of latency-critical applications. To alleviate these bottlenecks, this work explores a new fine-grain mapping paradigm, referred to as layer fusion, on heterogeneous dataflow accelerators through a novel design space exploration framework called \u0000<i>Stream</i>\u0000. \u0000<i>Stream</i>\u0000 captures a wide variety of heterogeneous dataflow architectures and mapping granularities, and implements a memory and communication-aware latency and energy analysis validated with three distinct state-of-the-art hardware implementations. As such, it facilitates a holistic exploration of architecture and mapping, by strategically allocating the workload through constraint optimization. The findings demonstrate that the integration of layer fusion with heterogeneous dataflow accelerators yields up to \u0000<inline-formula><tex-math>$2.2times$</tex-math></inline-formula>\u0000 lower energy-delay product in inference efficiency, addressing both energy consumption and latency concerns. The framework is available open-source at: github.com/kuleuven-micas/stream.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"237-249"},"PeriodicalIF":3.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FedQClip: Accelerating Federated Learning via Quantized Clipped SGD","authors":"Zhihao Qu;Ninghui Jia;Baoliu Ye;Shihong Hu;Song Guo","doi":"10.1109/TC.2024.3477972","DOIUrl":"https://doi.org/10.1109/TC.2024.3477972","url":null,"abstract":"Federated Learning (FL) has emerged as a promising technique for collaboratively training machine learning models among multiple participants while preserving privacy-sensitive data. However, the conventional parameter server architecture presents challenges in terms of communication overhead when employing iterative optimization methods such as Stochastic Gradient Descent (SGD). Although communication compression techniques can reduce the traffic cost of FL during each training round, they often lead to degraded convergence rates, mainly due to compression errors and data heterogeneity. To address these issues, this paper presents FedQClip, an innovative approach that combines quantization and Clipped SGD. FedQClip leverages an adaptive step size inversely proportional to the <inline-formula><tex-math>$ell_{2}$</tex-math></inline-formula> norm of the gradient, effectively mitigating the negative impacts of quantized errors. Additionally, clipped operations can be applied locally and globally to further expedite training. Theoretical analyses provide evidence that, even under the settings of Non-IID (non-independent and identically distributed) data, FedQClip achieves a convergence rate of <inline-formula><tex-math>$mathcal{O}(frac{1}{sqrt{T}})$</tex-math></inline-formula>, effectively addressing the convergence degradation caused by compression errors. Furthermore, our theoretical analysis highlights the importance of selecting an appropriate number of local updates to enhance the convergence of FL training. Through extensive experiments, we demonstrate that FedQClip outperforms state-of-the-art methods in terms of communication efficiency and convergence rate.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"717-730"},"PeriodicalIF":3.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Davide Galli;Francesco Lattari;Matteo Matteucci;Davide Zoni
{"title":"A Deep Learning-Assisted Template Attack Against Dynamic Frequency Scaling Countermeasures","authors":"Davide Galli;Francesco Lattari;Matteo Matteucci;Davide Zoni","doi":"10.1109/TC.2024.3477997","DOIUrl":"https://doi.org/10.1109/TC.2024.3477997","url":null,"abstract":"In the last decades, machine learning techniques have been extensively used in place of classical template attacks to implement profiled side-channel analysis. This manuscript focuses on the application of machine learning to counteract Dynamic Frequency Scaling defenses. While state-of-the-art attacks have shown promising results against desynchronization countermeasures, a robust attack strategy has yet to be realized. Motivated by the simplicity and effectiveness of template attacks for devices lacking desynchronization countermeasures, this work presents a Deep Learning-assisted Template Attack (DLaTA) methodology specifically designed to target highly desynchronized traces through Dynamic Frequency Scaling. A deep learning-based pre-processing step recovers information obscured by desynchronization, followed by a template attack for key extraction. Specifically, we developed a three-stage deep learning pipeline to resynchronize traces to a uniform reference clock frequency. The experimental results on the AES cryptosystem executed on a RISC-V System-on-Chip reported a Guessing Entropy equal to 1 and a Guessing Distance greater than 0.25. Results demonstrate the method's ability to successfully retrieve secret keys even in the presence of high desynchronization. As an additional contribution, we publicly release our \u0000<monospace>DFS_DESYNCH</monospace>\u0000 database\u0000<xref><sup>1</sup></xref>\u0000<fn><label><sup>1</sup></label><p><uri>https://github.com/hardware-fab/DLaTA</uri></p></fn>\u0000 containing the first set of real-world highly desynchronized power traces from the execution of a software AES cryptosystem.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"293-306"},"PeriodicalIF":3.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10713265","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Bian;Fangyu Zheng;Yuewu Wang;Lingguang Lei;Yuan Ma;Tian Zhou;Jiankuo Dong;Guang Fan;Jiwu Jing
{"title":"AsyncGBP${}^{+}$+: Bridging SSL/TLS and Heterogeneous Computing Power With GPU-Based Providers","authors":"Yi Bian;Fangyu Zheng;Yuewu Wang;Lingguang Lei;Yuan Ma;Tian Zhou;Jiankuo Dong;Guang Fan;Jiwu Jing","doi":"10.1109/TC.2024.3477987","DOIUrl":"https://doi.org/10.1109/TC.2024.3477987","url":null,"abstract":"The rapid evolution of GPUs has emerged as a promising solution for accelerating the worldwide used SSL/TLS, which faces performance bottlenecks due to its underlying heavy cryptographic computations. Nevertheless, substantial structural adjustments from the parallel mode of GPUs to the serial mode of the SSL/TLS stack are imperative, potentially constraining the practical deployment of GPUs. In this paper, we propose AsyncGBP<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula>, a three-level framework that facilitates the seamless conversion of cryptographic requests from synchronous to asynchronous mode. We conduct an in-depth analysis of the OpenSSL provider and cryptographic primitive features relevant to GPU implementations, aiming to fully exploit the potential of GPUs. Notably, AsyncGBP<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula> supports three working settings (offline/online/hybrid), finely tailored for various public key cryptographic primitives, including traditional ones like X25519, Ed25519, ECDSA, and the quantum-safe CRYSTALS-Kyber. A comprehensive evaluation demonstrates that AsyncGBP<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula> can efficiently achieve an improvement of up to 137.8<inline-formula><tex-math>$times$</tex-math></inline-formula> compared to the default OpenSSL provider (for X25519, Ed25519, ECDSA) and 113.30<inline-formula><tex-math>$times$</tex-math></inline-formula> compared to OpenSSL-compatible <monospace>liboqs</monospace> (for CRYSTALS-Kyber) in a single-process setting. Furthermore, AsyncGBP<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula> surpasses the current fastest commercial-off-the-shelf OpenSSL-compatible TLS accelerator with a 5.3<inline-formula><tex-math>$times$</tex-math></inline-formula> to 7.0<inline-formula><tex-math>$times$</tex-math></inline-formula> performance improvement.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"356-370"},"PeriodicalIF":3.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}