Straggler Detection in Parallel Computing Systems through Dynamic Threshold Calculation

Ouyang Xue, P. Garraghan, D. McKee, P. Townend, Jie Xu
{"title":"Straggler Detection in Parallel Computing Systems through Dynamic Threshold Calculation","authors":"Ouyang Xue, P. Garraghan, D. McKee, P. Townend, Jie Xu","doi":"10.1109/AINA.2016.84","DOIUrl":null,"url":null,"abstract":"Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62% less replicas under high resource utilization while reducing response time up to 17.86% for idle periods compared to a static threshold.","PeriodicalId":438655,"journal":{"name":"2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AINA.2016.84","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 31

Abstract

Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62% less replicas under high resource utilization while reducing response time up to 17.86% for idle periods compared to a static threshold.
基于动态阈值计算的并行计算系统中掉队者检测
云计算系统面临着长尾问题的重大挑战:一小部分分散的任务严重阻碍了并行作业的完成。这种行为会导致服务响应时间变长,系统利用率下降。推测执行(在运行时创建任务副本)是部署在大规模分布式系统中以容忍掉队者的典型方法。这种方法通过指定静态阈值来定义掉队者,该阈值计算单个任务与作业的平均任务进度之间的时间差异。然而,指定静态阈值削弱了投机的有效性,因为它没有考虑到现代云计算系统中作业时间约束的内在多样性。捕获这种异构性使我们能够为副本创建施加不同级别的严格性,同时为不同的应用程序类型实现指定级别的QoS。此外,静态阈值也没有考虑复制开销和最佳系统资源使用方面的系统环境约束。在本文中,我们提出了一种动态计算阈值来识别任务掉队者的算法,该算法考虑了包括作业QoS定时约束、任务执行特征和最优系统资源利用率在内的关键参数。我们研究并展示了我们的算法的有效性,通过模拟基于真实生产集群数据的许多不同的操作场景,针对最先进的解决方案。结果表明,与静态阈值相比,我们的方法能够在高资源利用率下减少58.62%的副本,同时将空闲期间的响应时间减少17.86%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信