Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent

Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems Pub Date : 2017-12-19 DOI:10.1145/3219617.3219655

Yudong Chen, Lili Su, Jiaming Xu

{"title":"Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent","authors":"Yudong Chen, Lili Su, Jiaming Xu","doi":"10.1145/3219617.3219655","DOIUrl":null,"url":null,"abstract":"We consider the distributed statistical learning problem over decentralized systems that are prone to adversarial attacks. This setup arises in many practical applications, including Google's Federated Learning. Formally, we focus on a decentralized system that consists of a parameter server and m working machines; each working machine keeps N/m data samples, where N is the total number of samples. In each iteration, up to q of the m working machines suffer Byzantine faults -- a faulty machine in the given iteration behaves arbitrarily badly against the system and has complete knowledge of the system. Additionally, the sets of faulty machines may be different across iterations. Our goal is to design robust algorithms such that the system can learn the underlying true parameter, which is of dimension d, despite the interruption of the Byzantine attacks. In this paper, based on the geometric median of means of the gradients, we propose a simple variant of the classical gradient descent method. We show that our method can tolerate q Byzantine failures up to 2(1+ε)q łe m for an arbitrarily small but fixed constant ε>0. The parameter estimate converges in O(łog N) rounds with an estimation error on the order of max √dq/N, ~√d/N , which is larger than the minimax-optimal error rate √d/N in the centralized and failure-free setting by at most a factor of √q . The total computational complexity of our algorithm is of O((Nd/m) log N) at each working machine and O(md + kd log 3 N) at the central server, and the total communication cost is of O(m d log N). We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. To handle this issue in the analysis, we prove that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function.","PeriodicalId":210440,"journal":{"name":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"164","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3219617.3219655","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 164

Abstract

We consider the distributed statistical learning problem over decentralized systems that are prone to adversarial attacks. This setup arises in many practical applications, including Google's Federated Learning. Formally, we focus on a decentralized system that consists of a parameter server and m working machines; each working machine keeps N/m data samples, where N is the total number of samples. In each iteration, up to q of the m working machines suffer Byzantine faults -- a faulty machine in the given iteration behaves arbitrarily badly against the system and has complete knowledge of the system. Additionally, the sets of faulty machines may be different across iterations. Our goal is to design robust algorithms such that the system can learn the underlying true parameter, which is of dimension d, despite the interruption of the Byzantine attacks. In this paper, based on the geometric median of means of the gradients, we propose a simple variant of the classical gradient descent method. We show that our method can tolerate q Byzantine failures up to 2(1+ε)q łe m for an arbitrarily small but fixed constant ε>0. The parameter estimate converges in O(łog N) rounds with an estimation error on the order of max √dq/N, ~√d/N , which is larger than the minimax-optimal error rate √d/N in the centralized and failure-free setting by at most a factor of √q . The total computational complexity of our algorithm is of O((Nd/m) log N) at each working machine and O(md + kd log 3 N) at the central server, and the total communication cost is of O(m d log N). We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. To handle this issue in the analysis, we prove that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function.

查看原文本刊更多论文

对抗环境中的分布式统计机器学习:拜占庭梯度下降

我们考虑分布式统计学习问题在分散的系统，容易受到对抗性攻击。这种设置出现在许多实际应用程序中，包括b谷歌的联邦学习。形式上，我们关注的是一个分散的系统，它由一个参数服务器和m台工作机器组成;每台工作机器保留N/m个数据样本，其中N为样本总数。在每次迭代中，m台工作机器中多达q台会出现拜占庭故障——给定迭代中的故障机器对系统的行为非常糟糕，并且完全了解系统。另外，故障机器的集合在迭代中可能是不同的。我们的目标是设计健壮的算法，使系统能够学习潜在的真实参数，这是d维的，尽管拜占庭攻击的中断。本文基于梯度均值的几何中位数，提出了经典梯度下降法的一种简单变体。我们证明，对于一个任意小但固定的常数ε>0，我们的方法可以容忍高达2(1+ε)q łe m的q拜占庭故障。参数估计在0 (łog N)轮内收敛，估计误差在max√dq/N， ~√d/N数量级，比集中无故障设置下的最小最优错误率√d/N大不超过一个√q。我们的算法在每台工作机器上的总计算复杂度为O((Nd/m) log N)，在中央服务器上的总计算复杂度为O(md + kd log 3n)，总通信成本为O(md log N)。我们进一步将我们的一般结果应用于线性回归问题。在上述问题中出现的一个关键挑战是，拜占庭式失败在迭代和聚合梯度之间创建了任意且未指定的依赖关系。为了在分析中解决这一问题，我们证明了聚合梯度作为模型参数的函数一致收敛于真梯度函数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems

自引率

0.00%

发文量