Towards Self-Managing Cloud Storage with Reinforcement Learning

Ridwan Rashid Noel, Rohit Mehra, P. Lama
{"title":"Towards Self-Managing Cloud Storage with Reinforcement Learning","authors":"Ridwan Rashid Noel, Rohit Mehra, P. Lama","doi":"10.1109/IC2E.2019.000-9","DOIUrl":null,"url":null,"abstract":"Cloud storage services are often associated with various performance issues due to load imbalance, interference from background tasks such as data scrubbing, backfilling, recovery, and the difference in processing capabilities of heterogeneous servers in a datacenter. This has a significant impact on a broad range of applications that are characterized by massive working sets and real-time constraints. However, it is challenging and burdensome for human operators to hand-tune various control-knobs in a cloud-scale storage cluster for maintaining optimal performance under diverse workload conditions. Our study on an open-source object-based storage system, Ceph, shows that common load balancing strategies are ineffective unless they are adapted according to workload characteristics. Furthermore, positive effects of an applied strategy may not be immediately visible. To address these challenges, we developed a machine learning based system adaptation technique that enables a cloud storage system to manage itself through load balancing and data migration with the aim of delivering optimal performance in the face of diverse workload patterns and resource bottlenecks. In particular, we applied a stochastic policy gradient based reinforcement learning technique to track performance hotspots in the storage cluster, and take appropriate corrective actions to maximize future performance under a variety of complex scenarios. For this purpose, we leveraged system-level performance monitoring and commonly available control-knobs in object-based cloud storage systems. We implemented the developed techniques to build an Adaptive Resource Management (ARM) system for object based storage cluster, and evaluated its performance on NSF Cloud's Chameleon testbed. Experiments using Cloud Object Storage Benchmark (COSBench) show that, ARM improves the average read and write response time of Ceph storage cluster by upto 50% and 33% respectively, compared to the default case. It also outperforms a state-of-the-art dynamic load rebalancing technique in terms of read and write performance of Ceph storage by 43% and 36% respectively.","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Cloud Engineering (IC2E)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC2E.2019.000-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

Cloud storage services are often associated with various performance issues due to load imbalance, interference from background tasks such as data scrubbing, backfilling, recovery, and the difference in processing capabilities of heterogeneous servers in a datacenter. This has a significant impact on a broad range of applications that are characterized by massive working sets and real-time constraints. However, it is challenging and burdensome for human operators to hand-tune various control-knobs in a cloud-scale storage cluster for maintaining optimal performance under diverse workload conditions. Our study on an open-source object-based storage system, Ceph, shows that common load balancing strategies are ineffective unless they are adapted according to workload characteristics. Furthermore, positive effects of an applied strategy may not be immediately visible. To address these challenges, we developed a machine learning based system adaptation technique that enables a cloud storage system to manage itself through load balancing and data migration with the aim of delivering optimal performance in the face of diverse workload patterns and resource bottlenecks. In particular, we applied a stochastic policy gradient based reinforcement learning technique to track performance hotspots in the storage cluster, and take appropriate corrective actions to maximize future performance under a variety of complex scenarios. For this purpose, we leveraged system-level performance monitoring and commonly available control-knobs in object-based cloud storage systems. We implemented the developed techniques to build an Adaptive Resource Management (ARM) system for object based storage cluster, and evaluated its performance on NSF Cloud's Chameleon testbed. Experiments using Cloud Object Storage Benchmark (COSBench) show that, ARM improves the average read and write response time of Ceph storage cluster by upto 50% and 33% respectively, compared to the default case. It also outperforms a state-of-the-art dynamic load rebalancing technique in terms of read and write performance of Ceph storage by 43% and 36% respectively.
用强化学习实现云存储的自我管理
由于负载不平衡、后台任务(如数据清理、回填、恢复)的干扰以及数据中心中异构服务器处理能力的差异,云存储服务经常与各种性能问题相关联。这对以大量工作集和实时限制为特征的广泛应用程序产生了重大影响。然而,对于人工操作员来说,在云规模存储集群中手动调整各种控制旋钮以在不同的工作负载条件下保持最佳性能是一项挑战和繁重的工作。我们对开源对象存储系统Ceph的研究表明,除非根据工作负载特征进行调整,否则常见的负载平衡策略是无效的。此外,应用策略的积极效果可能不会立即显现。为了应对这些挑战,我们开发了一种基于机器学习的系统适应技术,使云存储系统能够通过负载平衡和数据迁移来管理自身,从而在面对各种工作负载模式和资源瓶颈时提供最佳性能。特别是,我们应用了基于随机策略梯度的强化学习技术来跟踪存储集群中的性能热点,并在各种复杂场景下采取适当的纠正措施来最大化未来的性能。为此,我们在基于对象的云存储系统中利用了系统级性能监视和常用的控制旋钮。将所开发的技术应用于基于对象存储集群的自适应资源管理(ARM)系统,并在NSF Cloud的变色龙测试平台上对其性能进行了评估。使用Cloud Object Storage Benchmark (COSBench)进行的实验表明,与默认情况相比,ARM将Ceph存储集群的平均读和写响应时间分别提高了50%和33%。在Ceph存储的读写性能方面,它也比最先进的动态负载再平衡技术分别高出43%和36%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信