大规模分布式系统的鲁棒调度

2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) Pub Date : 2020-12-01 DOI:10.1109/TrustCom50675.2020.00019

Young Choon Lee, Jayden King, Young Ki Kim, Seok-Hee Hong

{"title":"大规模分布式系统的鲁棒调度","authors":"Young Choon Lee, Jayden King, Young Ki Kim, Seok-Hee Hong","doi":"10.1109/TrustCom50675.2020.00019","DOIUrl":null,"url":null,"abstract":"In large-scale distributed systems, such as clouds, failures are rather the norm than the exception. These failures include job failures, server failures, network outage and power failure. Among them, server failures are most common. With the wide adoption of cloud computing, the impact of server failures in clouds is far greater than that in traditional computer clusters as jobs of different tenants are often co-located (multi-tenancy). In this paper, we address the problem of robust scheduling, with realistic failure modeling, to minimize such impact on the execution of (co-located) jobs. To this end, we develop four online failure-aware (FA) scheduling algorithms, FAFF-WJ, FAFF-FC, FABF-WJ and FABF-FC, considering the availability and reliability of servers. In particular, FF (First-Fit) and BF (Best-Fit) indicate how the availability of servers is checked while WJ (Waiting Job) and FC (Failure Count) differ primarily in whether the reliability is measured from job's perspective or server's perspective. All four algorithms are designed essentially by combining these availability and reliability check methods. We evaluate our scheduling algorithms with failures generated based on our failure modeling of six real-world server failure traces. Our evaluation results show the effectiveness of our scheduling algorithms in robust job execution, with respect to both performance and cost.","PeriodicalId":221956,"journal":{"name":"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust Scheduling for Large-Scale Distributed Systems\",\"authors\":\"Young Choon Lee, Jayden King, Young Ki Kim, Seok-Hee Hong\",\"doi\":\"10.1109/TrustCom50675.2020.00019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In large-scale distributed systems, such as clouds, failures are rather the norm than the exception. These failures include job failures, server failures, network outage and power failure. Among them, server failures are most common. With the wide adoption of cloud computing, the impact of server failures in clouds is far greater than that in traditional computer clusters as jobs of different tenants are often co-located (multi-tenancy). In this paper, we address the problem of robust scheduling, with realistic failure modeling, to minimize such impact on the execution of (co-located) jobs. To this end, we develop four online failure-aware (FA) scheduling algorithms, FAFF-WJ, FAFF-FC, FABF-WJ and FABF-FC, considering the availability and reliability of servers. In particular, FF (First-Fit) and BF (Best-Fit) indicate how the availability of servers is checked while WJ (Waiting Job) and FC (Failure Count) differ primarily in whether the reliability is measured from job's perspective or server's perspective. All four algorithms are designed essentially by combining these availability and reliability check methods. We evaluate our scheduling algorithms with failures generated based on our failure modeling of six real-world server failure traces. Our evaluation results show the effectiveness of our scheduling algorithms in robust job execution, with respect to both performance and cost.\",\"PeriodicalId\":221956,\"journal\":{\"name\":\"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TrustCom50675.2020.00019\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TrustCom50675.2020.00019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在像云这样的大规模分布式系统中，故障是常态，而不是例外。这些故障包括作业故障、服务器故障、网络中断和电源故障。其中，服务器故障最为常见。随着云计算的广泛采用，云中的服务器故障的影响远远大于传统计算机集群中的服务器故障，因为不同租户的作业通常位于同一位置(多租户)。在本文中，我们通过实际的故障建模来解决鲁棒调度问题，以最大限度地减少对(共定位)作业执行的影响。为此，考虑到服务器的可用性和可靠性，我们开发了四种在线故障感知(FA)调度算法:FAFF-WJ、FAFF-FC、FABF-WJ和FABF-FC。特别是，FF (First-Fit)和BF (Best-Fit)表明如何检查服务器的可用性，而WJ (Waiting Job)和FC (Failure Count)的区别主要在于可靠性是从作业的角度还是从服务器的角度测量的。这四种算法基本上都是通过结合这些可用性和可靠性检查方法来设计的。我们用基于六个真实服务器故障轨迹的故障建模生成的故障来评估我们的调度算法。我们的评估结果显示了我们的调度算法在鲁棒作业执行方面的有效性，无论是性能还是成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Robust Scheduling for Large-Scale Distributed Systems

In large-scale distributed systems, such as clouds, failures are rather the norm than the exception. These failures include job failures, server failures, network outage and power failure. Among them, server failures are most common. With the wide adoption of cloud computing, the impact of server failures in clouds is far greater than that in traditional computer clusters as jobs of different tenants are often co-located (multi-tenancy). In this paper, we address the problem of robust scheduling, with realistic failure modeling, to minimize such impact on the execution of (co-located) jobs. To this end, we develop four online failure-aware (FA) scheduling algorithms, FAFF-WJ, FAFF-FC, FABF-WJ and FABF-FC, considering the availability and reliability of servers. In particular, FF (First-Fit) and BF (Best-Fit) indicate how the availability of servers is checked while WJ (Waiting Job) and FC (Failure Count) differ primarily in whether the reliability is measured from job's perspective or server's perspective. All four algorithms are designed essentially by combining these availability and reliability check methods. We evaluate our scheduling algorithms with failures generated based on our failure modeling of six real-world server failure traces. Our evaluation results show the effectiveness of our scheduling algorithms in robust job execution, with respect to both performance and cost.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)

自引率

0.00%

发文量