一种用于优化短暂云资源利用的强化学习策略

2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom) Pub Date : 2020-09-23 DOI:10.1109/CloudCom49646.2020.00009

Mohamed Handaoui, Jean-Emile Dartois, Jalil Boukhobza, Olivier Barais, Laurent d'Orazio

{"title":"一种用于优化短暂云资源利用的强化学习策略","authors":"Mohamed Handaoui, Jean-Emile Dartois, Jalil Boukhobza, Olivier Barais, Laurent d'Orazio","doi":"10.1109/CloudCom49646.2020.00009","DOIUrl":null,"url":null,"abstract":"Cloud data center capacities are over-provisioned to handle demand peaks and hardware failures which leads to low resources' utilization. One way to improve resource utilization and thus reduce the total cost of ownership is to offer unused resources (referred to as ephemeral resources) at a lower price. However, reselling resources needs to meet the expectations of its customers in terms of Quality of Service. The goal is so to maximize the amount of reclaimed resources while avoiding SLA penalties. To achieve that, cloud providers have to estimate their future utilization to provide availability guarantees. The prediction should consider a safety margin for resources to react to unpredictable workloads. The challenge is to find the safety margin that provides the best trade-off between the amount of resources to reclaim and the risk of SLA violations. Most state-of-the-art solutions consider a fixed safety margin for all types of metrics (e.g., CPU, RAM). However, a unique fixed margin does not consider various workloads variations over time which may lead to SLA violations or/and poor utilization. In order to tackle these challenges, we propose ReLeaSER, a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud. ReLeaSER dynamically tunes the safety margin at the host-level for each resource metric. The strategy learns from past prediction errors (that caused SLA violations). Our solution reduces significantly the SLA violation penalties on average by $\\mathbf{2.7}\\times$ and up to $\\mathbf{3.4}\\times$. It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.","PeriodicalId":401135,"journal":{"name":"2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources\",\"authors\":\"Mohamed Handaoui, Jean-Emile Dartois, Jalil Boukhobza, Olivier Barais, Laurent d'Orazio\",\"doi\":\"10.1109/CloudCom49646.2020.00009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cloud data center capacities are over-provisioned to handle demand peaks and hardware failures which leads to low resources' utilization. One way to improve resource utilization and thus reduce the total cost of ownership is to offer unused resources (referred to as ephemeral resources) at a lower price. However, reselling resources needs to meet the expectations of its customers in terms of Quality of Service. The goal is so to maximize the amount of reclaimed resources while avoiding SLA penalties. To achieve that, cloud providers have to estimate their future utilization to provide availability guarantees. The prediction should consider a safety margin for resources to react to unpredictable workloads. The challenge is to find the safety margin that provides the best trade-off between the amount of resources to reclaim and the risk of SLA violations. Most state-of-the-art solutions consider a fixed safety margin for all types of metrics (e.g., CPU, RAM). However, a unique fixed margin does not consider various workloads variations over time which may lead to SLA violations or/and poor utilization. In order to tackle these challenges, we propose ReLeaSER, a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud. ReLeaSER dynamically tunes the safety margin at the host-level for each resource metric. The strategy learns from past prediction errors (that caused SLA violations). Our solution reduces significantly the SLA violation penalties on average by $\\\\mathbf{2.7}\\\\times$ and up to $\\\\mathbf{3.4}\\\\times$. It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.\",\"PeriodicalId\":401135,\"journal\":{\"name\":\"2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CloudCom49646.2020.00009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudCom49646.2020.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

云数据中心的容量被过度配置，以应对需求高峰和硬件故障，从而导致资源利用率低。提高资源利用率从而降低总拥有成本的一种方法是以较低的价格提供未使用的资源(称为临时资源)。然而，转售资源需要满足客户在服务质量方面的期望。我们的目标是最大化回收的资源量，同时避免SLA惩罚。为了实现这一点，云提供商必须估计其未来的利用率，以提供可用性保证。预测应该考虑资源对不可预测的工作负载作出反应的安全范围。挑战在于找到在回收的资源量和违反SLA的风险之间提供最佳折衷的安全范围。大多数最先进的解决方案考虑所有类型指标(例如，CPU, RAM)的固定安全裕度。但是，唯一的固定余量没有考虑到随着时间的推移各种工作负载的变化，这可能导致SLA违反或/和利用率低下。为了应对这些挑战，我们提出了ReLeaSER，一种用于优化云中短暂资源利用的强化学习策略。release动态地调整每个资源度量在主机级别的安全裕度。该策略从过去的预测错误(导致SLA违规)中学习。我们的解决方案显著地减少了SLA违规处罚，平均减少了$\mathbf{2.7}\times$，最多减少了$\mathbf{3.4}\times$。它还大大提高了CPs的潜在节省，平均节省27.6%，最高可达43.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources

Cloud data center capacities are over-provisioned to handle demand peaks and hardware failures which leads to low resources' utilization. One way to improve resource utilization and thus reduce the total cost of ownership is to offer unused resources (referred to as ephemeral resources) at a lower price. However, reselling resources needs to meet the expectations of its customers in terms of Quality of Service. The goal is so to maximize the amount of reclaimed resources while avoiding SLA penalties. To achieve that, cloud providers have to estimate their future utilization to provide availability guarantees. The prediction should consider a safety margin for resources to react to unpredictable workloads. The challenge is to find the safety margin that provides the best trade-off between the amount of resources to reclaim and the risk of SLA violations. Most state-of-the-art solutions consider a fixed safety margin for all types of metrics (e.g., CPU, RAM). However, a unique fixed margin does not consider various workloads variations over time which may lead to SLA violations or/and poor utilization. In order to tackle these challenges, we propose ReLeaSER, a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud. ReLeaSER dynamically tunes the safety margin at the host-level for each resource metric. The strategy learns from past prediction errors (that caused SLA violations). Our solution reduces significantly the SLA violation penalties on average by $\mathbf{2.7}\times$ and up to $\mathbf{3.4}\times$. It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)

自引率

0.00%

发文量