Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems最新文献

筛选
英文 中文
CARE: Infusing Causal Aware Thinking to Root Cause Analysis in Cloud System 关注:将因果意识思维注入云系统的根本原因分析
Yong Xu, Xu Zhang, Chuan Luo, Si Qin, Rohitashwa Pandey, Chao Du, Qingwei Lin, Yingnong Dang, Andrew Zhou
{"title":"CARE: Infusing Causal Aware Thinking to Root Cause Analysis in Cloud System","authors":"Yong Xu, Xu Zhang, Chuan Luo, Si Qin, Rohitashwa Pandey, Chao Du, Qingwei Lin, Yingnong Dang, Andrew Zhou","doi":"10.1145/3447851.3458737","DOIUrl":"https://doi.org/10.1145/3447851.3458737","url":null,"abstract":"With millions of customers accessing online service all over the world, ensuring high service availability is very critical for cloud system. In recent years, empowered by advanced data mining and machine learning technology, there emerges extensive study on data-driven solution to detect anomalous system behavior and diagnose the root cause. However, without any surveilance of data generation process, the existing passive data-driven approach may lead to biased analysis result induced by observed and unobserved confounding factors in the dynamic and heterogeneous system, and thus affect service availability with misleading mitigation actions. In this paper, we propose to infuse causal thinking to the current data-driven solution for cloud system. We developed CARE, a causal aware root cause discovery engine, which utilizes Random Control Trial to proactively generate less ambiguous data for further analysis. A case study shows the application of CARE to Microsoft Office365.","PeriodicalId":166666,"journal":{"name":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114272702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Examining Raft's behaviour during partial network failures 检查筏的行为在部分网络故障
C. Jensen, H. Howard, R. Mortier
{"title":"Examining Raft's behaviour during partial network failures","authors":"C. Jensen, H. Howard, R. Mortier","doi":"10.1145/3447851.3458739","DOIUrl":"https://doi.org/10.1145/3447851.3458739","url":null,"abstract":"State machine replication protocols such as Raft are widely used to build highly-available strongly-consistent services, maintaining liveness even if a minority of servers crash. As these systems are implemented and optimised for production, they accumulate many divergences from the original specification. These divergences are poorly documented, resulting in operators having an incomplete model of the system's characteristics, especially during failures. In this paper, we look at one such Raft model used to explain the November Cloudflare outage and show that etcd's behaviour during the same failure differs. We continue to show the specific optimisations in etcd causing this difference and present a more complete model of the outage based on etcd's behaviour in an emulated deployment using reckon. Finally, we highlight the upcoming PreVote optimisation in etcd, which might have prevented the outage from happening in the first place.","PeriodicalId":166666,"journal":{"name":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","volume":"28 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128757670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Service mesh circuit breaker: From panic button to performance management tool 服务网格断路器:从紧急按钮到性能管理工具
Mohammad Reza Saleh Sedghpour, C. Klein, Johan Tordsson
{"title":"Service mesh circuit breaker: From panic button to performance management tool","authors":"Mohammad Reza Saleh Sedghpour, C. Klein, Johan Tordsson","doi":"10.1145/3447851.3458740","DOIUrl":"https://doi.org/10.1145/3447851.3458740","url":null,"abstract":"Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments. We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.","PeriodicalId":166666,"journal":{"name":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115007701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Frisbee 飞盘
F. Nikolaidis, A. Chazapis, M. Marazakis, A. Bilas
{"title":"Frisbee","authors":"F. Nikolaidis, A. Chazapis, M. Marazakis, A. Bilas","doi":"10.1145/3447851.3458738","DOIUrl":"https://doi.org/10.1145/3447851.3458738","url":null,"abstract":"With failures being unavoidable, a system's ability to recover from failures quickly is a critical factor in the overall availability of the system. Although many systems exhibit self-healing properties, their behavior in the presence of failures is poorly understood. This is primarily due to the shortcomings of existing benchmarks, which cannot generate failures. For a more accurate systems evaluation, we argue that it is essential to create new suites that treat failures as first-class citizens. We present Frisbee, a benchmark suite and evaluation methodology for comparing the recovery behavior of highly available systems. Frisbee is built for the Kubernetes environment, leveraging several valuable tools in its stack, including Chaos tools for fault injection, Prometheus for distributed monitoring, and Grafana for visualization. We discuss a set of design requirements and present an initial prototype that makes faultloads as easy to run and characterize as traditional performance workloads. Furthermore, we define a core set of failure patterns against which systems can be compared.","PeriodicalId":166666,"journal":{"name":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117123360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信