Distributed Online Service Coordination Using Deep Reinforcement Learning

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS) Pub Date : 2021-07-01 DOI:10.1109/ICDCS51616.2021.00058

Stefan Schneider, Haydar Qarawlus, Holger Karl

{"title":"Distributed Online Service Coordination Using Deep Reinforcement Learning","authors":"Stefan Schneider, Haydar Qarawlus, Holger Karl","doi":"10.1109/ICDCS51616.2021.00058","DOIUrl":null,"url":null,"abstract":"Services often consist of multiple chained components such as microservices in a service mesh, or machine learning functions in a pipeline. Providing these services requires online coordination including scaling the service, placing instance of all components in the network, scheduling traffic to these instances, and routing traffic through the network. Optimized service coordination is still a hard problem due to many influencing factors such as rapidly arriving user demands and limited node and link capacity. Existing approaches to solve the problem are often built on rigid models and assumptions, tailored to specific scenarios. If the scenario changes and the assumptions no longer hold, they easily break and require manual adjustments by experts. Novel self-learning approaches using deep reinforcement learning (DRL) are promising but still have limitations as they only address simplified versions of the problem and are typically centralized and thus do not scale to practical large-scale networks. To address these issues, we propose a distributed self-learning service coordination approach using DRL. After centralized training, we deploy a distributed DRL agent at each node in the network, making fast coordination decisions locally in parallel with the other nodes. Each agent only observes its direct neighbors and does not need global knowledge. Hence, our approach scales independently from the size of the network. In our extensive evaluation using real-world network topologies and traffic traces, we show that our proposed approach outperforms a state-of-the-art conventional heuristic as well as a centralized DRL approach (60 % higher throughput on average) while requiring less time per online decision (1 ms).","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS51616.2021.00058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Services often consist of multiple chained components such as microservices in a service mesh, or machine learning functions in a pipeline. Providing these services requires online coordination including scaling the service, placing instance of all components in the network, scheduling traffic to these instances, and routing traffic through the network. Optimized service coordination is still a hard problem due to many influencing factors such as rapidly arriving user demands and limited node and link capacity. Existing approaches to solve the problem are often built on rigid models and assumptions, tailored to specific scenarios. If the scenario changes and the assumptions no longer hold, they easily break and require manual adjustments by experts. Novel self-learning approaches using deep reinforcement learning (DRL) are promising but still have limitations as they only address simplified versions of the problem and are typically centralized and thus do not scale to practical large-scale networks. To address these issues, we propose a distributed self-learning service coordination approach using DRL. After centralized training, we deploy a distributed DRL agent at each node in the network, making fast coordination decisions locally in parallel with the other nodes. Each agent only observes its direct neighbors and does not need global knowledge. Hence, our approach scales independently from the size of the network. In our extensive evaluation using real-world network topologies and traffic traces, we show that our proposed approach outperforms a state-of-the-art conventional heuristic as well as a centralized DRL approach (60 % higher throughput on average) while requiring less time per online decision (1 ms).

查看原文本刊更多论文

使用深度强化学习的分布式在线服务协调

服务通常由多个链式组件组成，比如服务网格中的微服务，或者管道中的机器学习功能。提供这些服务需要在线协调，包括扩展服务、在网络中放置所有组件的实例、为这些实例调度流量以及通过网络路由流量。由于用户需求到达速度快、节点链路容量有限等因素的影响，优化服务协调仍然是一个难题。现有的解决问题的方法通常建立在严格的模型和假设之上，并针对特定的场景进行了调整。如果情景发生变化，假设不再成立，它们很容易失效，需要专家进行手动调整。使用深度强化学习(DRL)的新颖自学习方法很有前途，但仍然有局限性，因为它们只解决问题的简化版本，并且通常是集中的，因此不能扩展到实际的大规模网络。为了解决这些问题，我们提出了一种使用DRL的分布式自学习服务协调方法。在集中训练后，我们在网络的每个节点上部署分布式DRL代理，与其他节点并行地在本地进行快速协调决策。每个智能体只观察它的直接邻居，不需要全局知识。因此，我们的方法与网络的大小无关。在我们使用真实网络拓扑和流量跟踪进行的广泛评估中，我们表明，我们提出的方法优于最先进的传统启发式方法和集中式DRL方法(平均吞吐量提高60%)，同时每个在线决策所需的时间更少(1毫秒)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

自引率

0.00%

发文量