服务网格断路器:从紧急按钮到性能管理工具

Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems Pub Date : 2021-04-26 DOI:10.1145/3447851.3458740

Mohammad Reza Saleh Sedghpour, C. Klein, Johan Tordsson

{"title":"服务网格断路器:从紧急按钮到性能管理工具","authors":"Mohammad Reza Saleh Sedghpour, C. Klein, Johan Tordsson","doi":"10.1145/3447851.3458740","DOIUrl":null,"url":null,"abstract":"Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments. We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.","PeriodicalId":166666,"journal":{"name":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Service mesh circuit breaker: From panic button to performance management tool\",\"authors\":\"Mohammad Reza Saleh Sedghpour, C. Klein, Johan Tordsson\",\"doi\":\"10.1145/3447851.3458740\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments. We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.\",\"PeriodicalId\":166666,\"journal\":{\"name\":\"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3447851.3458740\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447851.3458740","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

站点可靠性工程师处于两种紧张关系的中心:一方面，他们需要在短时间内响应警报，以恢复非功能系统。另一方面，短的反应时间会扰乱日常生活，导致警觉疲劳。为了缓解这种紧张关系，提出了许多资源管理机制来处理过载和减少故障。最近的一种机制是在服务网格中断路。断路拒绝传入请求，以牺牲可用性(成功应答的请求)为代价来保护延迟，但在许多情况下，由于在高度动态的微服务环境中很难知道何时触发断路，因此两者都无法实现。我们提出了一种自适应断路机制，通过自适应控制器实现，不仅可以避免过载和减轻故障，而且可以在最大限度地提高服务吞吐量的同时保持尾部响应时间低于给定阈值。在基于Istio和Kubernetes的测试平台上，我们提出的控制器与静态断路器在各种过载场景下进行了实验比较。结果表明，我们的控制器在98%的时间内(包括冷启动)平均保持尾部响应时间低于给定阈值，可用性为70%，29%的请求电路中断。这优于静态断路器配置，后者具有63%的可用性、30%的电路中断请求和超过5%的请求超时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Service mesh circuit breaker: From panic button to performance management tool

Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments. We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems

自引率

0.00%

发文量