Aleksey VeresovKTH Royal Institute of Technology, Jonas SpengerKTH Royal Institute of Technology, Paris CarboneKTH Royal Institute of TechnologyRISE Research Institutes of Sweden, Philipp HallerKTH Royal Institute of Technology
{"title":"有状态数据流系统中的故障透明度(技术报告)","authors":"Aleksey VeresovKTH Royal Institute of Technology, Jonas SpengerKTH Royal Institute of Technology, Paris CarboneKTH Royal Institute of TechnologyRISE Research Institutes of Sweden, Philipp HallerKTH Royal Institute of Technology","doi":"arxiv-2407.06738","DOIUrl":null,"url":null,"abstract":"Failure transparency enables users to reason about distributed systems at a\nhigher level of abstraction, where complex failure-handling logic is hidden.\nThis is especially true for stateful dataflow systems, which are the backbone\nof many cloud applications. In particular, this paper focuses on proving\nfailure transparency in Apache Flink, a popular stateful dataflow system. Even\nthough failure transparency is a critical aspect of Apache Flink, to date it\nhas not been formally proven. Showing that the failure transparency mechanism\nis correct, however, is challenging due to the complexity of the mechanism\nitself. Nevertheless, this complexity can be effectively hidden behind a\nfailure transparent programming interface. To show that Apache Flink is failure\ntransparent, we model it in small-step operational semantics. Next, we provide\na novel definition of failure transparency based on observational\nexplainability, a concept which relates executions according to their\nobservations. Finally, we provide a formal proof of failure transparency for\nthe implementation model; i.e., we prove that the failure-free model correctly\nabstracts from the failure-related details of the implementation model. We also\nshow liveness of the implementation model under a fair execution assumption.\nThese results are a first step towards a verified stack for stateful dataflow\nsystems.","PeriodicalId":501197,"journal":{"name":"arXiv - CS - Programming Languages","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Failure Transparency in Stateful Dataflow Systems (Technical Report)\",\"authors\":\"Aleksey VeresovKTH Royal Institute of Technology, Jonas SpengerKTH Royal Institute of Technology, Paris CarboneKTH Royal Institute of TechnologyRISE Research Institutes of Sweden, Philipp HallerKTH Royal Institute of Technology\",\"doi\":\"arxiv-2407.06738\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Failure transparency enables users to reason about distributed systems at a\\nhigher level of abstraction, where complex failure-handling logic is hidden.\\nThis is especially true for stateful dataflow systems, which are the backbone\\nof many cloud applications. In particular, this paper focuses on proving\\nfailure transparency in Apache Flink, a popular stateful dataflow system. Even\\nthough failure transparency is a critical aspect of Apache Flink, to date it\\nhas not been formally proven. Showing that the failure transparency mechanism\\nis correct, however, is challenging due to the complexity of the mechanism\\nitself. Nevertheless, this complexity can be effectively hidden behind a\\nfailure transparent programming interface. To show that Apache Flink is failure\\ntransparent, we model it in small-step operational semantics. Next, we provide\\na novel definition of failure transparency based on observational\\nexplainability, a concept which relates executions according to their\\nobservations. Finally, we provide a formal proof of failure transparency for\\nthe implementation model; i.e., we prove that the failure-free model correctly\\nabstracts from the failure-related details of the implementation model. We also\\nshow liveness of the implementation model under a fair execution assumption.\\nThese results are a first step towards a verified stack for stateful dataflow\\nsystems.\",\"PeriodicalId\":501197,\"journal\":{\"name\":\"arXiv - CS - Programming Languages\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Programming Languages\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.06738\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.06738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Failure Transparency in Stateful Dataflow Systems (Technical Report)
Failure transparency enables users to reason about distributed systems at a
higher level of abstraction, where complex failure-handling logic is hidden.
This is especially true for stateful dataflow systems, which are the backbone
of many cloud applications. In particular, this paper focuses on proving
failure transparency in Apache Flink, a popular stateful dataflow system. Even
though failure transparency is a critical aspect of Apache Flink, to date it
has not been formally proven. Showing that the failure transparency mechanism
is correct, however, is challenging due to the complexity of the mechanism
itself. Nevertheless, this complexity can be effectively hidden behind a
failure transparent programming interface. To show that Apache Flink is failure
transparent, we model it in small-step operational semantics. Next, we provide
a novel definition of failure transparency based on observational
explainability, a concept which relates executions according to their
observations. Finally, we provide a formal proof of failure transparency for
the implementation model; i.e., we prove that the failure-free model correctly
abstracts from the failure-related details of the implementation model. We also
show liveness of the implementation model under a fair execution assumption.
These results are a first step towards a verified stack for stateful dataflow
systems.