服务网格断路器:从紧急按钮到性能管理工具

Mohammad Reza Saleh Sedghpour, C. Klein, Johan Tordsson
{"title":"服务网格断路器:从紧急按钮到性能管理工具","authors":"Mohammad Reza Saleh Sedghpour, C. Klein, Johan Tordsson","doi":"10.1145/3447851.3458740","DOIUrl":null,"url":null,"abstract":"Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments. We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.","PeriodicalId":166666,"journal":{"name":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Service mesh circuit breaker: From panic button to performance management tool\",\"authors\":\"Mohammad Reza Saleh Sedghpour, C. Klein, Johan Tordsson\",\"doi\":\"10.1145/3447851.3458740\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments. We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.\",\"PeriodicalId\":166666,\"journal\":{\"name\":\"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3447851.3458740\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447851.3458740","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

摘要

站点可靠性工程师处于两种紧张关系的中心:一方面,他们需要在短时间内响应警报,以恢复非功能系统。另一方面,短的反应时间会扰乱日常生活,导致警觉疲劳。为了缓解这种紧张关系,提出了许多资源管理机制来处理过载和减少故障。最近的一种机制是在服务网格中断路。断路拒绝传入请求,以牺牲可用性(成功应答的请求)为代价来保护延迟,但在许多情况下,由于在高度动态的微服务环境中很难知道何时触发断路,因此两者都无法实现。我们提出了一种自适应断路机制,通过自适应控制器实现,不仅可以避免过载和减轻故障,而且可以在最大限度地提高服务吞吐量的同时保持尾部响应时间低于给定阈值。在基于Istio和Kubernetes的测试平台上,我们提出的控制器与静态断路器在各种过载场景下进行了实验比较。结果表明,我们的控制器在98%的时间内(包括冷启动)平均保持尾部响应时间低于给定阈值,可用性为70%,29%的请求电路中断。这优于静态断路器配置,后者具有63%的可用性、30%的电路中断请求和超过5%的请求超时。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Service mesh circuit breaker: From panic button to performance management tool
Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments. We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信