了解瞬时云资源上分布式ML的同步成本

2019 IEEE International Conference on Cloud Engineering (IC2E) Pub Date : 2019-06-01 DOI:10.1109/IC2E.2019.00029

Pradeep Ambati, David E. Irwin, P. Shenoy, Lixin Gao, A. Ali-Eldin, Jeannie R. Albrecht

{"title":"了解瞬时云资源上分布式ML的同步成本","authors":"Pradeep Ambati, David E. Irwin, P. Shenoy, Lixin Gao, A. Ali-Eldin, Jeannie R. Albrecht","doi":"10.1109/IC2E.2019.00029","DOIUrl":null,"url":null,"abstract":"Cloud platforms often execute parallel batch applications, such as distributed machine learning (ML), that include numerous synchronization barriers. These barriers, which prevent any task from advancing beyond a specified point until all tasks have reached that point, significantly degrade application performance by reducing it to that of the slowest \"straggler\" task. To address the problem, researchers have proposed numerous straggler mitigation techniques, including speculatively re-executing straggler tasks and various relaxations of strict barrier semantics. While these techniques improve parallel application performance, they incur a cost in terms of the resources wasted re-executing tasks or waiting. Importantly, these costs, which are often implicit in prior work that targets dedicated resources, become explicit in the cloud, which charges for resources at fine-grained intervals. In addition, the cost difference between techniques is exacerbated in cloud platforms, since they charge substantially less for transient resources that effectively yield a probabilistic performance across a wide range. While transient resources' low list price is attractive, revocations increase the frequency and severity of stragglers, which decreases parallel job performance and increases overall execution cost. To better understand the cost of synchronization, we develop simple analytical models of different straggler mitigation techniques and compare their cost and performance on on-demand and transient resources. Our analysis shows that i) transient servers offer complex tradeoffs compared to on-demand servers, and can result in higher overall costs despite their highly discounted price due to their probabilistic performance; ii) common approaches to straggler mitigation, which is a well-studied problem, are less effective using transient servers that cause frequent and severe stragglers; and iii) a recent approach to flexible synchronization offers the best cost and performance.","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Understanding Synchronization Costs for Distributed ML on Transient Cloud Resources\",\"authors\":\"Pradeep Ambati, David E. Irwin, P. Shenoy, Lixin Gao, A. Ali-Eldin, Jeannie R. Albrecht\",\"doi\":\"10.1109/IC2E.2019.00029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cloud platforms often execute parallel batch applications, such as distributed machine learning (ML), that include numerous synchronization barriers. These barriers, which prevent any task from advancing beyond a specified point until all tasks have reached that point, significantly degrade application performance by reducing it to that of the slowest \\\"straggler\\\" task. To address the problem, researchers have proposed numerous straggler mitigation techniques, including speculatively re-executing straggler tasks and various relaxations of strict barrier semantics. While these techniques improve parallel application performance, they incur a cost in terms of the resources wasted re-executing tasks or waiting. Importantly, these costs, which are often implicit in prior work that targets dedicated resources, become explicit in the cloud, which charges for resources at fine-grained intervals. In addition, the cost difference between techniques is exacerbated in cloud platforms, since they charge substantially less for transient resources that effectively yield a probabilistic performance across a wide range. While transient resources' low list price is attractive, revocations increase the frequency and severity of stragglers, which decreases parallel job performance and increases overall execution cost. To better understand the cost of synchronization, we develop simple analytical models of different straggler mitigation techniques and compare their cost and performance on on-demand and transient resources. Our analysis shows that i) transient servers offer complex tradeoffs compared to on-demand servers, and can result in higher overall costs despite their highly discounted price due to their probabilistic performance; ii) common approaches to straggler mitigation, which is a well-studied problem, are less effective using transient servers that cause frequent and severe stragglers; and iii) a recent approach to flexible synchronization offers the best cost and performance.\",\"PeriodicalId\":226094,\"journal\":{\"name\":\"2019 IEEE International Conference on Cloud Engineering (IC2E)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Conference on Cloud Engineering (IC2E)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IC2E.2019.00029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Cloud Engineering (IC2E)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC2E.2019.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

云平台经常执行并行批处理应用程序，例如分布式机器学习(ML)，其中包含许多同步障碍。这些障碍阻止任何任务超越指定的点，直到所有任务都到达该点，通过将应用程序减少到最慢的“离散”任务，显著降低了应用程序的性能。为了解决这个问题，研究人员提出了许多掉队者缓解技术，包括推测性地重新执行掉队者任务和各种严格屏障语义的放松。虽然这些技术提高了并行应用程序的性能，但它们在重新执行任务或等待方面产生了资源浪费的成本。重要的是，这些成本在之前针对专用资源的工作中通常是隐含的，但在云计算中变得显式，云计算以细粒度间隔对资源收费。此外，在云平台中，技术之间的成本差异也会加剧，因为它们对瞬时资源的收费要低得多，而这些资源可以有效地在大范围内产生概率性能。虽然暂态资源的低价很有吸引力，但撤销会增加掉队作业的频率和严重程度，从而降低并行作业的性能，增加总体执行成本。为了更好地了解同步的成本，我们开发了不同的离散缓解技术的简单分析模型，并比较了它们在按需和瞬态资源上的成本和性能。我们的分析表明，i)与按需服务器相比，瞬态服务器提供了复杂的权衡，并且可能导致更高的总体成本，尽管由于其概率性能，它们的价格很高;Ii)减少掉队的常见方法(这是一个得到充分研究的问题)，使用导致频繁和严重掉队的瞬态服务器效果较差;iii)灵活同步的最新方法提供了最佳的成本和性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Understanding Synchronization Costs for Distributed ML on Transient Cloud Resources

Cloud platforms often execute parallel batch applications, such as distributed machine learning (ML), that include numerous synchronization barriers. These barriers, which prevent any task from advancing beyond a specified point until all tasks have reached that point, significantly degrade application performance by reducing it to that of the slowest "straggler" task. To address the problem, researchers have proposed numerous straggler mitigation techniques, including speculatively re-executing straggler tasks and various relaxations of strict barrier semantics. While these techniques improve parallel application performance, they incur a cost in terms of the resources wasted re-executing tasks or waiting. Importantly, these costs, which are often implicit in prior work that targets dedicated resources, become explicit in the cloud, which charges for resources at fine-grained intervals. In addition, the cost difference between techniques is exacerbated in cloud platforms, since they charge substantially less for transient resources that effectively yield a probabilistic performance across a wide range. While transient resources' low list price is attractive, revocations increase the frequency and severity of stragglers, which decreases parallel job performance and increases overall execution cost. To better understand the cost of synchronization, we develop simple analytical models of different straggler mitigation techniques and compare their cost and performance on on-demand and transient resources. Our analysis shows that i) transient servers offer complex tradeoffs compared to on-demand servers, and can result in higher overall costs despite their highly discounted price due to their probabilistic performance; ii) common approaches to straggler mitigation, which is a well-studied problem, are less effective using transient servers that cause frequent and severe stragglers; and iii) a recent approach to flexible synchronization offers the best cost and performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Conference on Cloud Engineering (IC2E)

自引率

0.00%

发文量