A robust sampling technique for realistic distribution simulation in federated learning.

IF 2.3 3区医学 Q3 ENGINEERING, BIOMEDICAL

International Journal of Computer Assisted Radiology and Surgery Pub Date : 2025-09-02 DOI:10.1007/s11548-025-03504-z

Robin Hoepp, Leonhard Rist, Alexander Katzmann, Raghavan Ashok, Andreas Wimmer, Michael Sühling, Andreas Maier

{"title":"A robust sampling technique for realistic distribution simulation in federated learning.","authors":"Robin Hoepp, Leonhard Rist, Alexander Katzmann, Raghavan Ashok, Andreas Wimmer, Michael Sühling, Andreas Maier","doi":"10.1007/s11548-025-03504-z","DOIUrl":null,"url":null,"abstract":"Purpose: Federated Learning helps training deep learning networks with diverse data from different locations, particularly in restricted clinical settings. However, label distributions overlapping only partially across clients, due to different demographics, may significantly harm the global training, and thus local model performance. Investigating such effects before rolling out large-scale Federated Learning setups requires proper sampling of the expected label distributions.Methods: We present a sampling algorithm to build data subsets according to desired mean and standard deviations from an initial global distribution. To this end, we incorporate the chi-squared and Gini impurity measures to numerically optimize label distributions for multiple groups in an efficient fashion.Results: Using a real-world application scenario, we sample train and test groups according to region-specific distributions for 3D camera-based weight and height estimation in a clinical context, comparing a hard data split serving as a baseline with our proposed sampling technique. We train a baseline model on all data for comparison and use Federated Averaging to combine the training of our data subsets, demonstrating a realistic deterioration of 25.3 % on weight and 28.7 % on height estimations by the global model.Conclusions: Realistically client-biased label distribution can notably harm the training in a federated context. Our sampling algorithm for simulating realistic data distributions opens up an efficient way for prior analysis of this effect. The technique is agnostic to the chosen network architecture and target scenario and can be adapted to any feature or label problem with non-IID subpopulations.","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Assisted Radiology and Surgery","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1007/s11548-025-03504-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Federated Learning helps training deep learning networks with diverse data from different locations, particularly in restricted clinical settings. However, label distributions overlapping only partially across clients, due to different demographics, may significantly harm the global training, and thus local model performance. Investigating such effects before rolling out large-scale Federated Learning setups requires proper sampling of the expected label distributions.

Methods: We present a sampling algorithm to build data subsets according to desired mean and standard deviations from an initial global distribution. To this end, we incorporate the chi-squared and Gini impurity measures to numerically optimize label distributions for multiple groups in an efficient fashion.

Results: Using a real-world application scenario, we sample train and test groups according to region-specific distributions for 3D camera-based weight and height estimation in a clinical context, comparing a hard data split serving as a baseline with our proposed sampling technique. We train a baseline model on all data for comparison and use Federated Averaging to combine the training of our data subsets, demonstrating a realistic deterioration of 25.3 % on weight and 28.7 % on height estimations by the global model.

Conclusions: Realistically client-biased label distribution can notably harm the training in a federated context. Our sampling algorithm for simulating realistic data distributions opens up an efficient way for prior analysis of this effect. The technique is agnostic to the chosen network architecture and target scenario and can be adapted to any feature or label problem with non-IID subpopulations.

查看原文本刊更多论文

一种用于联邦学习中真实分布模拟的鲁棒抽样技术。

目的：联邦学习有助于训练来自不同地点的不同数据的深度学习网络，特别是在有限的临床环境中。然而，由于不同的人口统计数据，标签分布只在客户端部分重叠，可能会严重损害全局训练，从而影响局部模型的性能。在推出大规模的联邦学习设置之前，调查这种影响需要对预期的标签分布进行适当的采样。方法：我们提出了一种抽样算法，根据初始全局分布的期望均值和标准差建立数据子集。为此，我们结合卡方和基尼杂质措施，以有效的方式对多个群体的标签分布进行数值优化。结果：在现实世界的应用场景中，我们根据临床环境中基于3D相机的体重和身高估计的区域特定分布对训练和测试组进行采样，并将硬数据分割作为基线与我们提出的采样技术进行比较。我们在所有数据上训练了一个基线模型进行比较，并使用Federated Averaging将我们的数据子集的训练结合起来，证明了全球模型估计的体重和高度的实际退化程度分别为25.3%和28.7%。结论：实际上，客户偏见的标签分布会明显损害联邦环境中的训练。我们的模拟真实数据分布的抽样算法为这种效应的先验分析开辟了一种有效的方法。该技术与所选择的网络体系结构和目标场景无关，可以适用于具有非iid子种群的任何特征或标签问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Assisted Radiology and Surgery ENGINEERING, BIOMEDICAL-RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

CiteScore

5.90

自引率

6.70%

发文量

243

审稿时长

6-12 weeks

期刊介绍： The International Journal for Computer Assisted Radiology and Surgery (IJCARS) is a peer-reviewed journal that provides a platform for closing the gap between medical and technical disciplines, and encourages interdisciplinary research and development activities in an international environment.