F. Nikolaidis, A. Chazapis, M. Marazakis, A. Bilas
{"title":"Frisbee","authors":"F. Nikolaidis, A. Chazapis, M. Marazakis, A. Bilas","doi":"10.1145/3447851.3458738","DOIUrl":null,"url":null,"abstract":"With failures being unavoidable, a system's ability to recover from failures quickly is a critical factor in the overall availability of the system. Although many systems exhibit self-healing properties, their behavior in the presence of failures is poorly understood. This is primarily due to the shortcomings of existing benchmarks, which cannot generate failures. For a more accurate systems evaluation, we argue that it is essential to create new suites that treat failures as first-class citizens. We present Frisbee, a benchmark suite and evaluation methodology for comparing the recovery behavior of highly available systems. Frisbee is built for the Kubernetes environment, leveraging several valuable tools in its stack, including Chaos tools for fault injection, Prometheus for distributed monitoring, and Grafana for visualization. We discuss a set of design requirements and present an initial prototype that makes faultloads as easy to run and characterize as traditional performance workloads. Furthermore, we define a core set of failure patterns against which systems can be compared.","PeriodicalId":166666,"journal":{"name":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447851.3458738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
With failures being unavoidable, a system's ability to recover from failures quickly is a critical factor in the overall availability of the system. Although many systems exhibit self-healing properties, their behavior in the presence of failures is poorly understood. This is primarily due to the shortcomings of existing benchmarks, which cannot generate failures. For a more accurate systems evaluation, we argue that it is essential to create new suites that treat failures as first-class citizens. We present Frisbee, a benchmark suite and evaluation methodology for comparing the recovery behavior of highly available systems. Frisbee is built for the Kubernetes environment, leveraging several valuable tools in its stack, including Chaos tools for fault injection, Prometheus for distributed monitoring, and Grafana for visualization. We discuss a set of design requirements and present an initial prototype that makes faultloads as easy to run and characterize as traditional performance workloads. Furthermore, we define a core set of failure patterns against which systems can be compared.