Robust and Probabilistic Failure-Aware Placement

M. Korupolu, R. Rajaraman
{"title":"Robust and Probabilistic Failure-Aware Placement","authors":"M. Korupolu, R. Rajaraman","doi":"10.1145/2935764.2935802","DOIUrl":null,"url":null,"abstract":"Motivated by the growing complexity and heterogeneity of modern data centers, and the prevalence of commodity component failures, this paper studies the failure-aware placement problem of placing tasks of a parallel job on machines in the data center with the goal of increasing availability. We consider two models of failures: adversarial and probabilistic. In the adversarial model, each node has a weight (higher weight implying higher reliability) and the adversary can remove any subset of nodes of total weight at most a given bound W and our goal is to find a placement that incurs the least disruption against such an adversary. In the probabilistic model, each node has a probability of failure and we need to find a placement that maximizes the probability that at least K out of N tasks survive at any time. For adversarial failures, we first show that (i) the problems are in Σ2, the second level of the polynomial hierarchy, (ii) a basic variant, that we call RobustFAP, is co-NP-hard, and (iii) an all-or-nothing version of RobustFAP is Σ2-complete. We then give a PTAS for RobustFAP, a key ingredient of which is a solution that we design for a fractional version of RobustFAP. We then study fractional RobustFAP over hierarchies, denoted HierRobustFAP, and introduce a notion of hierarchical max-min fairness/ and a novel Generalized Spreading/ algorithm which is simultaneously optimal for all W. These generalize the classical notion of max-min fairness to work with nodes of differing capacities, differing reliability weights and hierarchical structures. Using randomized rounding, we extend this to give an algorithm for integral HierRobustFAP. For the probabilistic version, we first give an algorithm that achieves an additive ε approximation in the failure probability for the single level version, called ProbFAP, while giving up a (1 + ε) multiplicative factor in the number of failures. We then extend the result to the hierarchical version, HierProbFAP, achieving an ε additive approximation in failure probability while giving up an (L + ε) multiplicative factor in the number of failures, where $L$ is the number of levels in the hierarchy.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2935764.2935802","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Motivated by the growing complexity and heterogeneity of modern data centers, and the prevalence of commodity component failures, this paper studies the failure-aware placement problem of placing tasks of a parallel job on machines in the data center with the goal of increasing availability. We consider two models of failures: adversarial and probabilistic. In the adversarial model, each node has a weight (higher weight implying higher reliability) and the adversary can remove any subset of nodes of total weight at most a given bound W and our goal is to find a placement that incurs the least disruption against such an adversary. In the probabilistic model, each node has a probability of failure and we need to find a placement that maximizes the probability that at least K out of N tasks survive at any time. For adversarial failures, we first show that (i) the problems are in Σ2, the second level of the polynomial hierarchy, (ii) a basic variant, that we call RobustFAP, is co-NP-hard, and (iii) an all-or-nothing version of RobustFAP is Σ2-complete. We then give a PTAS for RobustFAP, a key ingredient of which is a solution that we design for a fractional version of RobustFAP. We then study fractional RobustFAP over hierarchies, denoted HierRobustFAP, and introduce a notion of hierarchical max-min fairness/ and a novel Generalized Spreading/ algorithm which is simultaneously optimal for all W. These generalize the classical notion of max-min fairness to work with nodes of differing capacities, differing reliability weights and hierarchical structures. Using randomized rounding, we extend this to give an algorithm for integral HierRobustFAP. For the probabilistic version, we first give an algorithm that achieves an additive ε approximation in the failure probability for the single level version, called ProbFAP, while giving up a (1 + ε) multiplicative factor in the number of failures. We then extend the result to the hierarchical version, HierProbFAP, achieving an ε additive approximation in failure probability while giving up an (L + ε) multiplicative factor in the number of failures, where $L$ is the number of levels in the hierarchy.
鲁棒和概率故障感知放置
由于现代数据中心日益复杂和异构,以及商品组件故障的普遍存在,本文以提高可用性为目标,研究了在数据中心的机器上放置并行作业任务的故障感知放置问题。我们考虑两种失败模型:对抗性和概率性。在对抗模型中,每个节点都有一个权重(更高的权重意味着更高的可靠性),对手可以在给定的范围内最多移除总权重的任何节点子集,我们的目标是找到一个对这样的对手造成最小干扰的位置。在概率模型中,每个节点都有一个失败的概率,我们需要找到一个位置,使N个任务中至少有K个任务在任何时候存活的概率最大化。对于对抗性失败,我们首先表明(i)问题在Σ2,多项式层次结构的第二层,(ii)一个基本变体,我们称之为RobustFAP,是共同np -hard的,(iii) RobustFAP的全有或全无版本是Σ2-complete。然后,我们给出了RobustFAP的PTAS,其中一个关键成分是我们为RobustFAP的分数版本设计的解决方案。然后,我们研究了层次上的分数阶RobustFAP,表示为HierRobustFAP,并引入了分层最大最小公平性的概念和一种同时对所有w最优的新型广义传播算法。这些算法将经典的最大最小公平性概念推广到不同容量、不同可靠性权重和层次结构的节点上。采用随机四舍五入的方法,对其进行了扩展,给出了一种积分HierRobustFAP算法。对于概率版本,我们首先给出了一种算法,该算法实现了单级版本的故障概率的可加性ε近似,称为ProbFAP,同时放弃了故障数量的(1 + ε)乘因子。然后我们将结果扩展到层次版本,HierProbFAP,在失败概率中获得ε加性近似,同时放弃失败数量中的(L + ε)乘法因子,其中$L$是层次中的级别数。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信