{"title":"FedCSpc: A Cross-Silo Federated Learning System With Error-Bounded Lossy Parameter Compression","authors":"Zhaorui Zhang;Sheng Di;Kai Zhao;Sian Jin;Dingwen Tao;Zhuoran Ji;Benben Liu;Khalid Ayed Alharthi;Jiannong Cao;Franck Cappello","doi":"10.1109/TPDS.2025.3564736","DOIUrl":null,"url":null,"abstract":"Cross-Silo federated learning is widely used for scaling deep neural network (DNN) training over data silos from different locations worldwide while guaranteeing data privacy. Communication has been identified as the main bottleneck when training large-scale models due to large-volume model parameters and gradient transmission across public networks with limited bandwidth. Most previous works focus on gradient compression, while limited work tries to compress parameters that can not be ignored and extremely affect communication performance during the training. To bridge this gap, we propose <italic>FedCSpc:</i> an efficient cross-silo federated learning system with an XAI-driven adaptive parameter compression strategy for large-scale model training. Our work substantially differs from existing gradient compression techniques due to the distinct data features of gradient and parameter. The key contributions of this paper are fourfold. (1) Our designed <italic>FedCSpc</i> proposes to compress the parameter during the training using the state-of-the-art error-bounded lossy compressor – SZ3. (2) We develop an adaptive compression error bound adjustment algorithm to guarantee the model accuracy effectively. (3) We exploit an efficient approach to utilize the idle CPU resources of clients to compress the parameters. (4) We perform a comprehensive evaluation with a wide range of models and benchmarks on a GPU cluster with 65 GPUs. Results show that <italic>FedCSpc</i> can achieve the same model accuracy as FedAvg while reducing the data volume of parameters and gradients in communication by up to 7.39× and 288×, respectively. With 32 clients on a 4 Gb size model, <italic>FedCSpc</i> significantly outperforms FedAvg in wall-clock time in the emulated WAN environment (at the bandwidth of 1 Gbps or lower without loss of generality).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1372-1386"},"PeriodicalIF":5.6000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10978107/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Cross-Silo federated learning is widely used for scaling deep neural network (DNN) training over data silos from different locations worldwide while guaranteeing data privacy. Communication has been identified as the main bottleneck when training large-scale models due to large-volume model parameters and gradient transmission across public networks with limited bandwidth. Most previous works focus on gradient compression, while limited work tries to compress parameters that can not be ignored and extremely affect communication performance during the training. To bridge this gap, we propose FedCSpc: an efficient cross-silo federated learning system with an XAI-driven adaptive parameter compression strategy for large-scale model training. Our work substantially differs from existing gradient compression techniques due to the distinct data features of gradient and parameter. The key contributions of this paper are fourfold. (1) Our designed FedCSpc proposes to compress the parameter during the training using the state-of-the-art error-bounded lossy compressor – SZ3. (2) We develop an adaptive compression error bound adjustment algorithm to guarantee the model accuracy effectively. (3) We exploit an efficient approach to utilize the idle CPU resources of clients to compress the parameters. (4) We perform a comprehensive evaluation with a wide range of models and benchmarks on a GPU cluster with 65 GPUs. Results show that FedCSpc can achieve the same model accuracy as FedAvg while reducing the data volume of parameters and gradients in communication by up to 7.39× and 288×, respectively. With 32 clients on a 4 Gb size model, FedCSpc significantly outperforms FedAvg in wall-clock time in the emulated WAN environment (at the bandwidth of 1 Gbps or lower without loss of generality).
期刊介绍:
IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to:
a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing.
b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems.
c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation.
d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.