Flotilla: A scalable, modular and resilient federated learning framework for heterogeneous resources

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing Pub Date : 2025-05-14 DOI:10.1016/j.jpdc.2025.105103

Roopkatha Banerjee , Prince Modi , Jinal Vyas , Chunduru Sri Abhijit , Tejus Chandrashekar , Harsha Varun Marisetty , Manik Gupta , Yogesh Simmhan

{"title":"Flotilla: A scalable, modular and resilient federated learning framework for heterogeneous resources","authors":"Roopkatha Banerjee , Prince Modi , Jinal Vyas , Chunduru Sri Abhijit , Tejus Chandrashekar , Harsha Varun Marisetty , Manik Gupta , Yogesh Simmhan","doi":"10.1016/j.jpdc.2025.105103","DOIUrl":null,"url":null,"abstract":"<div><div>With the recent improvements in mobile and edge computing and rising concerns of data privacy, <em>Federated Learning (FL)</em> has rapidly gained popularity as a privacy-preserving, distributed machine learning methodology. Several FL frameworks have been built for testing novel FL strategies. However, most focus on validating the <em>learning</em> aspects of FL through pseudo-distributed simulation but not for deploying on real edge hardware in a distributed manner to meaningfully evaluate the <em>federated</em> aspects from a systems perspective. Current frameworks are also inherently not designed to support asynchronous aggregation, which is gaining popularity, and have limited resilience to client and server failures. We introduce <span>Flotilla</span>, a scalable and lightweight FL framework. It adopts a “user-first” modular design to help rapidly compose various synchronous and asynchronous FL strategies while being agnostic to the DNN architecture. It uses stateless clients and a server design that separates out the session state, which are periodically or incrementally checkpointed. We demonstrate the modularity of <span>Flotilla</span> by evaluating five different FL strategies for training five DNN models. We also evaluate the client and server-side fault tolerance on 200+ clients, and showcase its ability to rapidly failover within seconds. Finally, we show that <span>Flotilla</span>'s resource usage on Raspberry Pis and Nvidia Jetson edge accelerators are comparable to or better than three state-of-the-art FL frameworks, Flower, OpenFL and FedML. It also scales significantly better compared to Flower for 1000+ clients. This positions <span>Flotilla</span> as a competitive candidate to build novel FL strategies on, compare them uniformly, rapidly deploy them, and perform systems research and optimizations.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"203 ","pages":"Article 105103"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S074373152500070X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

With the recent improvements in mobile and edge computing and rising concerns of data privacy, Federated Learning (FL) has rapidly gained popularity as a privacy-preserving, distributed machine learning methodology. Several FL frameworks have been built for testing novel FL strategies. However, most focus on validating the learning aspects of FL through pseudo-distributed simulation but not for deploying on real edge hardware in a distributed manner to meaningfully evaluate the federated aspects from a systems perspective. Current frameworks are also inherently not designed to support asynchronous aggregation, which is gaining popularity, and have limited resilience to client and server failures. We introduce Flotilla, a scalable and lightweight FL framework. It adopts a “user-first” modular design to help rapidly compose various synchronous and asynchronous FL strategies while being agnostic to the DNN architecture. It uses stateless clients and a server design that separates out the session state, which are periodically or incrementally checkpointed. We demonstrate the modularity of Flotilla by evaluating five different FL strategies for training five DNN models. We also evaluate the client and server-side fault tolerance on 200+ clients, and showcase its ability to rapidly failover within seconds. Finally, we show that Flotilla's resource usage on Raspberry Pis and Nvidia Jetson edge accelerators are comparable to or better than three state-of-the-art FL frameworks, Flower, OpenFL and FedML. It also scales significantly better compared to Flower for 1000+ clients. This positions Flotilla as a competitive candidate to build novel FL strategies on, compare them uniformly, rapidly deploy them, and perform systems research and optimizations.

查看原文本刊更多论文

Flotilla：针对异构资源的可伸缩、模块化和弹性的联邦学习框架

随着最近移动和边缘计算的改进以及对数据隐私的日益关注，联邦学习（FL）作为一种保护隐私的分布式机器学习方法迅速受到欢迎。已经建立了几个FL框架来测试新的FL策略。然而，大多数都侧重于通过伪分布式仿真验证FL的学习方面，而不是以分布式方式部署在真正的边缘硬件上，以便从系统的角度有意义地评估联邦方面。当前的框架本身也没有设计成支持异步聚合（异步聚合越来越流行），并且对客户端和服务器故障的恢复能力有限。我们介绍Flotilla，一个可扩展的轻量级FL框架。它采用“用户优先”的模块化设计，帮助快速组合各种同步和异步FL策略，同时与DNN架构无关。它使用无状态客户机和分离会话状态的服务器设计，会话状态是定期或增量检查点。我们通过评估五种不同的FL策略来训练五种DNN模型来展示Flotilla的模块化。我们还在200多个客户机上评估了客户机和服务器端的容错性，并展示了它在几秒钟内快速故障转移的能力。最后，我们表明Flotilla在Raspberry Pis和Nvidia Jetson边缘加速器上的资源使用情况与三个最先进的FL框架Flower， OpenFL和FedML相当或更好。与Flower相比，它的可扩展性也明显更好，可以支持1000多个客户端。这使得Flotilla成为一个有竞争力的候选人，可以在其上建立新的FL策略，统一比较它们，快速部署它们，并进行系统研究和优化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Parallel and Distributed Computing 工程技术-计算机：理论方法

CiteScore

10.30

自引率

2.60%

发文量

172

审稿时长

12 months

期刊介绍： This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing. The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.