{"title":"A short counterexample property for safety and liveness verification of fault-tolerant distributed algorithms","authors":"I. Konnov, M. Lazic, H. Veith, Josef Widder","doi":"10.1145/3009837.3009860","DOIUrl":null,"url":null,"abstract":"Distributed algorithms have many mission-critical applications ranging from embedded systems and replicated databases to cloud computing. Due to asynchronous communication, process faults, or network failures, these algorithms are difficult to design and verify. Many algorithms achieve fault tolerance by using threshold guards that, for instance, ensure that a process waits until it has received an acknowledgment from a majority of its peers. Consequently, domain-specific languages for fault-tolerant distributed systems offer language support for threshold guards. We introduce an automated method for model checking of safety and liveness of threshold-guarded distributed algorithms in systems where the number of processes and the fraction of faulty processes are parameters. Our method is based on a short counterexample property: if a distributed algorithm violates a temporal specification (in a fragment of LTL), then there is a counterexample whose length is bounded and independent of the parameters. We prove this property by (i) characterizing executions depending on the structure of the temporal formula, and (ii) using commutativity of transitions to accelerate and shorten executions. We extended the ByMC toolset (Byzantine Model Checker) with our technique, and verified liveness and safety of 10 prominent fault-tolerant distributed algorithms, most of which were out of reach for existing techniques.","PeriodicalId":20657,"journal":{"name":"Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"66","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3009837.3009860","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 66
Abstract
Distributed algorithms have many mission-critical applications ranging from embedded systems and replicated databases to cloud computing. Due to asynchronous communication, process faults, or network failures, these algorithms are difficult to design and verify. Many algorithms achieve fault tolerance by using threshold guards that, for instance, ensure that a process waits until it has received an acknowledgment from a majority of its peers. Consequently, domain-specific languages for fault-tolerant distributed systems offer language support for threshold guards. We introduce an automated method for model checking of safety and liveness of threshold-guarded distributed algorithms in systems where the number of processes and the fraction of faulty processes are parameters. Our method is based on a short counterexample property: if a distributed algorithm violates a temporal specification (in a fragment of LTL), then there is a counterexample whose length is bounded and independent of the parameters. We prove this property by (i) characterizing executions depending on the structure of the temporal formula, and (ii) using commutativity of transitions to accelerate and shorten executions. We extended the ByMC toolset (Byzantine Model Checker) with our technique, and verified liveness and safety of 10 prominent fault-tolerant distributed algorithms, most of which were out of reach for existing techniques.
分布式算法有许多关键任务应用,从嵌入式系统和复制数据库到云计算。由于异步通信、进程故障或网络故障,这些算法难以设计和验证。许多算法通过使用阈值保护来实现容错,例如,确保进程等待,直到它收到来自大多数对等节点的确认。因此,用于容错分布式系统的领域特定语言为阈值保护提供了语言支持。在以进程数量和故障进程比例为参数的系统中,我们介绍了一种用于阈值保护分布式算法的安全性和活动性模型检查的自动化方法。我们的方法基于一个简短的反例属性:如果一个分布式算法违反了时间规范(在LTL的片段中),那么就存在一个反例,其长度是有界的,与参数无关。我们通过(i)根据时间公式的结构来表征执行,以及(ii)使用转换的交换性来加速和缩短执行来证明这一性质。我们用我们的技术扩展了ByMC工具集(Byzantine Model Checker),并验证了10个突出的容错分布式算法的活动性和安全性,其中大多数是现有技术无法达到的。