Proceedings of the 27th ACM Symposium on Operating Systems Principles最新文献_第3页

CrashTuner

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359645

Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun Yang, Liang You

{"title":"CrashTuner","authors":"Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun Yang, Liang You","doi":"10.1145/3341301.3359645","DOIUrl":"https://doi.org/10.1145/3341301.3359645","url":null,"abstract":"Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult to detect crash-recovery bugs since these bugs can only be exposed when nodes crash under special timing conditions. This paper presents CrashTuner, a novel fault-injection testing approach to combat crash-recovery bugs. The novelty of CrashTuner lies in how we identify fault-injection points (crash points) that are likely to expose errors. We observe that if a node crashes while accessing meta-info variables, i.e., variables referencing high-level system state information (e.g., an instance of node or task), it often triggers crash-recovery bugs. Hence, we identify crash points by automatically inferring meta-info variables via a log-based static program analysis. Our approach is automatic and no manual specification is required. We have applied CrashTuner to five representative distributed systems: Hadoop2/Yarn, HBase, HDFS, ZooKeeper, and Cassandra. CrashTuner can finish testing each system in 17.39 hours, and reports 21 new bugs that have never been found before. All new bugs are confirmed by the original developers and 16 of them have already been fixed (14 with our patches). These new bugs can cause severe damages such as cluster down or start-up failures.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122533843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Verifying concurrent, crash-safe systems with Perennial 用Perennial验证并发的碰撞安全系统

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359632

Tej Chajed, Joseph Tassarotti, M. Kaashoek, N. Zeldovich

引用次数: 44

Fast and secure global payments with Stellar 快速和安全的全球支付与恒星

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359636

Marta Lokhava, Giuliano Losa, David Mazières, Graydon Hoare, N. Barry, E. Gafni, Jonathan Jove, Rafał Malinowsky, Jed McCaleb

引用次数: 89

The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure 拐点假设:定位故障根本原因的原则性调试方法

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359650

Yongle Zhang, Kirk Rodrigues, Yu Luo, M. Stumm, Ding Yuan

{"title":"The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure","authors":"Yongle Zhang, Kirk Rodrigues, Yu Luo, M. Stumm, Ding Yuan","doi":"10.1145/3341301.3359650","DOIUrl":"https://doi.org/10.1145/3341301.3359650","url":null,"abstract":"The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution. Thus, root cause analysis is transformed into a principled search problem to identify the non-failure execution with the longest common prefix. We present Kairux, a tool that does just that. It is, in most cases, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. Kairux uses tests from the system's rich unit test suite as building blocks to construct the non-failure execution that has the longest common prefix with the failure execution in order to locate the root cause. By evaluating Kairux on some of the most complex, real-world failures from HBase, HDFS, and ZooKeeper, we show that Kairux can accurately pinpoint each failure's respective root cause.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114906507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Using concurrent relational logic with helpers for verifying the AtomFS file system 使用并发关系逻辑和辅助程序来验证AtomFS文件系统

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359644

Mo Zou, Haoran Ding, Dong Du, Ming Fu, Ronghui Gu, Haibo Chen

{"title":"Using concurrent relational logic with helpers for verifying the AtomFS file system","authors":"Mo Zou, Haoran Ding, Dong Du, Ming Fu, Ronghui Gu, Haibo Chen","doi":"10.1145/3341301.3359644","DOIUrl":"https://doi.org/10.1145/3341301.3359644","url":null,"abstract":"Concurrent file systems are pervasive but hard to correctly implement and formally verify due to nondeterministic interleavings. This paper presents AtomFS, the first formally-verified, fine-grained, concurrent file system, which provides linearizable interfaces to applications. The standard way to prove linearizability requires modeling linearization point of each operation---the moment when its effect becomes visible atomically to other threads. We observe that path inter-dependency, where one operation (like rename) breaks the path integrity of other operations, makes the linearization point external and thus poses a significant challenge to prove linearizability. To overcome the above challenge, this paper presents Concurrent Relational Logic with Helpers (CRL-H), a framework for building verified concurrent file systems. CRL-H is made powerful through two key contributions: (1) extending prior approaches using fixed linearization points with a helper mechanism where one operation of the thread can logically help other threads linearize their operations; (2) combining relational specifications and rely/guarantee conditions for relational and compositional reasoning. We have successfully applied CRL-H to verify the linearizability of AtomFS directly in C code. All the proofs are mechanized in Coq. Evaluations show that AtomFS speeds up file system workloads by utilizing fine-grained, multicore concurrency.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130150556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Risk based planning of network changes in evolving data centers 基于风险的数据中心网络变更规划

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359664

Omid Alipourfard, Jiaqi Gao, Jérémie Koenig, Chris Harshaw, Amin Vahdat, Minlan Yu

引用次数: 16

Nexus: a GPU cluster engine for accelerating DNN-based video analysis Nexus:一个GPU集群引擎，用于加速基于dnn的视频分析

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359658

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, A. Krishnamurthy, Ravi Sundaram

引用次数: 151

An analysis of performance evolution of Linux's core operations Linux核心操作的性能演变分析

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359640

Xiang Ren, Kirk Rodrigues, Luyuan Chen, Juan Camilo Vega, M. Stumm, Ding Yuan

{"title":"An analysis of performance evolution of Linux's core operations","authors":"Xiang Ren, Kirk Rodrigues, Luyuan Chen, Juan Camilo Vega, M. Stumm, Ding Yuan","doi":"10.1145/3341301.3359640","DOIUrl":"https://doi.org/10.1145/3341301.3359640","url":null,"abstract":"This paper presents an analysis of how Linux's performance has evolved over the past seven years. Unlike recent works that focus on OS performance in terms of scalability or service of a particular workload, this study goes back to basics: the latency of core kernel operations (e.g., system calls, context switching, etc.). To our surprise, the study shows that the performance of many core operations has worsened or fluctuated significantly over the years. For example, the select system call is 100% slower than it was just two years ago. An in-depth analysis shows that over the past seven years, core kernel subsystems have been forced to accommodate an increasing number of security enhancements and new features. These additions steadily add overhead to core kernel operations but also frequently introduce extreme slowdowns of more than 100%. In addition, simple misconfigurations have also severely impacted kernel performance. Overall, we find most of the slowdowns can be attributed to 11 changes. Some forms of slowdown are avoidable with more proactive engineering. We show that it is possible to patch two security enhancements (from the 11 changes) to eliminate most of their overheads. In fact, several features have been introduced to the kernel unoptimized or insufficiently tested and then improved or disabled long after their release. Our findings also highlight both the feasibility and importance for Linux users to actively configure their systems to achieve an optimal balance between performance, functionality, and security: we discover that 8 out of the 11 changes can be avoided by reconfiguring the kernel, and the other 3 can be disabled through simple patches. By disabling the 11 changes with the goal of optimizing performance, we speed up Redis, Apache, and Nginx benchmark workloads by as much as 56%, 33%, and 34%, respectively.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124520670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

I4 预告

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359651

Haojun Ma, Aman Goel, Jean-Baptiste Jeannin, Manos Kapritsos, Baris Kasikci, K. Sakallah

{"title":"I4","authors":"Haojun Ma, Aman Goel, Jean-Baptiste Jeannin, Manos Kapritsos, Baris Kasikci, K. Sakallah","doi":"10.1145/3341301.3359651","DOIUrl":"https://doi.org/10.1145/3341301.3359651","url":null,"abstract":"Designing and implementing distributed systems correctly is a very challenging task. Recently, formal verification has been successfully used to prove the correctness of distributed systems. At the heart of formal verification lies a computer-checked proof with an inductive invariant. Finding this inductive invariant, however, is the most difficult part of the proof. Alas, current proof techniques require inductive invariants to be found manually---and painstakingly---by the developer. In this paper, we present a new approach, Incremental Inference of Inductive Invariants (I4), to automatically generate inductive invariants for distributed protocols. The essence of our idea is simple: the inductive invariant of a finite instance of the protocol can be used to infer a general inductive invariant for the infinite distributed protocol. In I4, we create a finite instance of the protocol; use a model checking tool to automatically derive the inductive invariant for this finite instance; and generalize this invariant to an inductive invariant for the infinite protocol. Our experiments show that I4 can prove the correctness of several distributed protocols like Chord, 2PC and Transaction Chains with little to no human effort.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115601646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution 文件系统不适合作为分布式存储后端:来自Ceph 10年发展的教训

Proceedings of the 27th ACM Symposium on Operating Systems Principles Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359656

Abutalib Aghayev, S. Weil, Michael Kuchnik, M. Nelson, G. Ganger, George Amvrosiadis

{"title":"File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution","authors":"Abutalib Aghayev, S. Weil, Michael Kuchnik, M. Nelson, G. Ganger, George Amvrosiadis","doi":"10.1145/3341301.3359656","DOIUrl":"https://doi.org/10.1145/3341301.3359656","url":null,"abstract":"For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today because it allows them to benefit from the convenience and maturity of battle-tested code. Ceph's experience, however, shows that this comes at a high price. First, developing a zero-overhead transaction mechanism is challenging. Second, metadata performance at the local level can significantly affect performance at the distributed level. Third, supporting emerging storage hardware is painstakingly slow. Ceph addressed these issues with BlueStore, a new back-end designed to run directly on raw storage devices. In only two years since its inception, BlueStore outperformed previous established backends and is adopted by 70% of users in production. By running in user space and fully controlling the I/O stack, it has enabled space-efficient metadata and data checksums, fast overwrites of erasure-coded data, inline compression, decreased performance variability, and avoided a series of performance pitfalls of local file systems. Finally, it makes the adoption of backwards-incompatible storage hardware possible, an important trait in a changing storage landscape that is learning to embrace hardware diversity.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125497174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66