Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems最新文献

CARE: Infusing Causal Aware Thinking to Root Cause Analysis in Cloud System 关注:将因果意识思维注入云系统的根本原因分析

Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems Pub Date : 2021-04-26 DOI: 10.1145/3447851.3458737

Yong Xu, Xu Zhang, Chuan Luo, Si Qin, Rohitashwa Pandey, Chao Du, Qingwei Lin, Yingnong Dang, Andrew Zhou

引用次数: 2

Examining Raft's behaviour during partial network failures 检查筏的行为在部分网络故障

Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems Pub Date : 2021-04-26 DOI: 10.1145/3447851.3458739

C. Jensen, H. Howard, R. Mortier

引用次数: 4

Service mesh circuit breaker: From panic button to performance management tool 服务网格断路器:从紧急按钮到性能管理工具

Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems Pub Date : 2021-04-26 DOI: 10.1145/3447851.3458740

Mohammad Reza Saleh Sedghpour, C. Klein, Johan Tordsson

{"title":"Service mesh circuit breaker: From panic button to performance management tool","authors":"Mohammad Reza Saleh Sedghpour, C. Klein, Johan Tordsson","doi":"10.1145/3447851.3458740","DOIUrl":"https://doi.org/10.1145/3447851.3458740","url":null,"abstract":"Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments. We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.","PeriodicalId":166666,"journal":{"name":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115007701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Frisbee 飞盘

Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems Pub Date : 1900-01-01 DOI: 10.1145/3447851.3458738

F. Nikolaidis, A. Chazapis, M. Marazakis, A. Bilas

引用次数: 4