Failure Transparency in Stateful Dataflow Systems (Technical Report)

arXiv - CS - Programming Languages Pub Date : 2024-07-09 DOI:arxiv-2407.06738

Aleksey VeresovKTH Royal Institute of Technology, Jonas SpengerKTH Royal Institute of Technology, Paris CarboneKTH Royal Institute of TechnologyRISE Research Institutes of Sweden, Philipp HallerKTH Royal Institute of Technology

{"title":"Failure Transparency in Stateful Dataflow Systems (Technical Report)","authors":"Aleksey VeresovKTH Royal Institute of Technology, Jonas SpengerKTH Royal Institute of Technology, Paris CarboneKTH Royal Institute of TechnologyRISE Research Institutes of Sweden, Philipp HallerKTH Royal Institute of Technology","doi":"arxiv-2407.06738","DOIUrl":null,"url":null,"abstract":"Failure transparency enables users to reason about distributed systems at a\nhigher level of abstraction, where complex failure-handling logic is hidden.\nThis is especially true for stateful dataflow systems, which are the backbone\nof many cloud applications. In particular, this paper focuses on proving\nfailure transparency in Apache Flink, a popular stateful dataflow system. Even\nthough failure transparency is a critical aspect of Apache Flink, to date it\nhas not been formally proven. Showing that the failure transparency mechanism\nis correct, however, is challenging due to the complexity of the mechanism\nitself. Nevertheless, this complexity can be effectively hidden behind a\nfailure transparent programming interface. To show that Apache Flink is failure\ntransparent, we model it in small-step operational semantics. Next, we provide\na novel definition of failure transparency based on observational\nexplainability, a concept which relates executions according to their\nobservations. Finally, we provide a formal proof of failure transparency for\nthe implementation model; i.e., we prove that the failure-free model correctly\nabstracts from the failure-related details of the implementation model. We also\nshow liveness of the implementation model under a fair execution assumption.\nThese results are a first step towards a verified stack for stateful dataflow\nsystems.","PeriodicalId":501197,"journal":{"name":"arXiv - CS - Programming Languages","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.06738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Failure transparency enables users to reason about distributed systems at a higher level of abstraction, where complex failure-handling logic is hidden. This is especially true for stateful dataflow systems, which are the backbone of many cloud applications. In particular, this paper focuses on proving failure transparency in Apache Flink, a popular stateful dataflow system. Even though failure transparency is a critical aspect of Apache Flink, to date it has not been formally proven. Showing that the failure transparency mechanism is correct, however, is challenging due to the complexity of the mechanism itself. Nevertheless, this complexity can be effectively hidden behind a failure transparent programming interface. To show that Apache Flink is failure transparent, we model it in small-step operational semantics. Next, we provide a novel definition of failure transparency based on observational explainability, a concept which relates executions according to their observations. Finally, we provide a formal proof of failure transparency for the implementation model; i.e., we prove that the failure-free model correctly abstracts from the failure-related details of the implementation model. We also show liveness of the implementation model under a fair execution assumption. These results are a first step towards a verified stack for stateful dataflow systems.

查看原文本刊更多论文

有状态数据流系统中的故障透明度（技术报告）

故障透明度使用户能够在更高的抽象层次上对分布式系统进行推理，而复杂的故障处理逻辑则被隐藏起来。本文尤其关注在流行的有状态数据流系统 Apache Flink 中证明故障透明度。尽管故障透明度是 Apache Flink 的一个关键方面，但迄今为止它尚未得到正式证明。然而，由于故障透明机制本身的复杂性，证明该机制的正确性具有挑战性。不过，这种复杂性可以有效地隐藏在故障透明的编程接口之后。为了证明 Apache Flink 是故障透明的，我们用小步运算语义对其进行了建模。接下来，我们提供了基于可观察性解释性的故障透明新定义，这是一个根据观察结果将执行联系起来的概念。最后，我们为实现模型提供了故障透明度的形式化证明；也就是说，我们证明了无故障模型正确抽象了实现模型中与故障相关的细节。我们还展示了在公平执行假设下实现模型的有效性。这些结果是迈向有状态数据流系统可验证堆栈的第一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Programming Languages

自引率

0.00%

发文量

文献相关原料

公司名称	产品信息	采购帮参考价格