High-Performance Hardware Acceleration Architecture for Cross-Silo Federated Learning

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-06-13 DOI:10.1109/TPDS.2024.3413718

Junxue Zhang;Xiaodian Cheng;Liu Yang;Jinbin Hu;Han Tian;Kai Chen

{"title":"High-Performance Hardware Acceleration Architecture for Cross-Silo Federated Learning","authors":"Junxue Zhang;Xiaodian Cheng;Liu Yang;Jinbin Hu;Han Tian;Kai Chen","doi":"10.1109/TPDS.2024.3413718","DOIUrl":null,"url":null,"abstract":"Cross-silo federated learning (FL) adopts various cryptographic operations to preserve data privacy, which introduces significant performance overhead. In this paper, we identify nine widely-used cryptographic operations and design an efficient hardware architecture to accelerate them. However, directly offloading them on hardware statically leads to (1) inadequate hardware acceleration due to the limited resources allocated to each operation; (2) insufficient resource utilization, since different operations are used at different times. To address these challenges, we propose FLASH, a high-performance hardware acceleration architecture for cross-silo FL systems. At its heart, FLASH extracts two basic operators—modular exponentiation and multiplication—behind the nine cryptographic operations and implements them as highly-performant engines to achieve adequate acceleration. Furthermore, it leverages a dataflow scheduling scheme to dynamically compose different cryptographic operations based on these basic engines to obtain sufficient resource utilization. We have implemented a fully-functional FLASH prototype with Xilinx VU13P FPGA and integrated it with FATE, the most widely-adopted cross-silo FL framework. Experimental results show that, for the nine cryptographic operations, FLASH achieves up to \n<inline-formula><tex-math>$14.0\\times$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>$3.4\\times$</tex-math></inline-formula>\n acceleration over CPU and GPU, translating to up to \n<inline-formula><tex-math>$6.8\\times$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>$2.0\\times$</tex-math></inline-formula>\n speedup for realistic FL applications, respectively. We finally evaluate the FLASH design as an ASIC, and it achieves \n<inline-formula><tex-math>$23.6\\times$</tex-math></inline-formula>\n performance improvement upon the FPGA prototype.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1506-1523"},"PeriodicalIF":5.6000,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10556815/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-silo federated learning (FL) adopts various cryptographic operations to preserve data privacy, which introduces significant performance overhead. In this paper, we identify nine widely-used cryptographic operations and design an efficient hardware architecture to accelerate them. However, directly offloading them on hardware statically leads to (1) inadequate hardware acceleration due to the limited resources allocated to each operation; (2) insufficient resource utilization, since different operations are used at different times. To address these challenges, we propose FLASH, a high-performance hardware acceleration architecture for cross-silo FL systems. At its heart, FLASH extracts two basic operators—modular exponentiation and multiplication—behind the nine cryptographic operations and implements them as highly-performant engines to achieve adequate acceleration. Furthermore, it leverages a dataflow scheduling scheme to dynamically compose different cryptographic operations based on these basic engines to obtain sufficient resource utilization. We have implemented a fully-functional FLASH prototype with Xilinx VU13P FPGA and integrated it with FATE, the most widely-adopted cross-silo FL framework. Experimental results show that, for the nine cryptographic operations, FLASH achieves up to

$14.0\times$

and

$3.4\times$

acceleration over CPU and GPU, translating to up to

$6.8\times$

and

$2.0\times$

speedup for realistic FL applications, respectively. We finally evaluate the FLASH design as an ASIC, and it achieves

$23.6\times$

performance improvement upon the FPGA prototype.

查看原文本刊更多论文

跨ilo 联合学习的高性能硬件加速架构

跨ilo 联合学习（FL）采用各种加密操作来保护数据隐私，这带来了巨大的性能开销。在本文中，我们确定了九种广泛使用的加密操作，并设计了一种高效的硬件架构来加速这些操作。然而，直接在硬件上静态卸载这些操作会导致：(1) 由于分配给每个操作的资源有限，硬件加速不足；(2) 由于不同操作在不同时间使用，资源利用率不足。为了应对这些挑战，我们提出了 FLASH，一种用于跨单片机 FL 系统的高性能硬件加速架构。FLASH 的核心是提取九个加密操作背后的两个基本运算符--模块化指数运算和乘法运算，并将它们作为高性能引擎来实现充分加速。此外，它还利用数据流调度方案，在这些基本引擎的基础上动态组合不同的加密运算，以获得足够的资源利用率。我们利用赛灵思 VU13P FPGA 实现了一个功能齐全的 FLASH 原型，并将其与 FATE（最广泛采用的跨单片机 FL 框架）集成。实验结果表明，对于九种加密操作，FLASH 比 CPU 和 GPU 分别实现了高达 14.0 美元/次和 3.4 美元/次的加速，对于现实的 FL 应用，分别实现了高达 6.8 美元/次和 2.0 美元/次的提速。最后，我们评估了作为 ASIC 的 FLASH 设计，它比 FPGA 原型的性能提高了 23.6 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.