{"title":"Gumshoe: Diagnosing Performance Problems in Replicated File-Systems","authors":"Soila Kavulya, R. Gandhi, P. Narasimhan","doi":"10.1109/SRDS.2008.35","DOIUrl":"https://doi.org/10.1109/SRDS.2008.35","url":null,"abstract":"Replicated file-systems can experience degraded performance that might not be adequately handled by the underlying fault-tolerant protocols. We describe the design and implementation of Gumshoe, a system that aims to diagnose performance problems in replicated file-systems. Gumshoe periodically gathers OS and protocol metrics and then analyzes these metrics to automatically localize the performance problem to the culprit node(s). We describe our results and experiences with problem diagnosis in two replicated file-systems (replicated-CoreFS and BFS) using two file-system benchmarks (Postmark and IOzone).","PeriodicalId":397103,"journal":{"name":"2008 Symposium on Reliable Distributed Systems","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123242802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-Tolerant Coverage Planning in Wireless Networks","authors":"S. Ivanov, E. Nett","doi":"10.1109/SRDS.2008.14","DOIUrl":"https://doi.org/10.1109/SRDS.2008.14","url":null,"abstract":"Typically wireless networks coverage is planned with static redundancy to compensate temporal variations in the environment. As a result, the service still is delivered but the network coverage could have entered a critical state, meaning that further changes in the environment may lead to service failure. Service failures have to be explicitly notified by the applications. Therefore, in this paper we propose a methodology for fault-tolerant coverage planning. The idea is detecting the critical state and removing it by on-line system reconfiguration, and restoration of the original static redundancy. Even in case of a failure the system automatically generates a new configuration to restore the service, leading to shorter repair times. We describe how this approach can be applied to wireless mesh networks, often used in industrial applications like manufacturing, automation and logistics. The evaluation results show that the underlying model used for error detection and system recovery is accurate enough to correctly identify the system state.","PeriodicalId":397103,"journal":{"name":"2008 Symposium on Reliable Distributed Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123754472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Damián Serrano, M. Patiño-Martínez, R. Jiménez-Peris, Bettina Kemme
{"title":"An Autonomic Approach for Replication of Internet-based Services","authors":"Damián Serrano, M. Patiño-Martínez, R. Jiménez-Peris, Bettina Kemme","doi":"10.1109/SRDS.2008.22","DOIUrl":"https://doi.org/10.1109/SRDS.2008.22","url":null,"abstract":"As more and more applications are deployed as Internet-based services, they have to be available anytime anywhere in a seamless manner. This requires the underlying infrastructure to provide scalability, fault tolerance and fast response times. While replicating the services and the data they access across sites that are located in different geographic regions is a promising means to achieve these requirements, data consistency is challenging if data continuously changes and queries are dynamic by nature, as is typical for e-commerce applications.Thus, current WAN replication solutions either trade performance for data consistency or are notable to scale in wide-area settings. In this paper, we present a novel approach to provide performance and consistency for Internet services. One of the main contributions is an autonomic replica placement module that places data copies only on servers close to clients that actually need them. The goal is to find the right trade-off between fast local access and the overhead of keeping data copies consistent. As data access patterns might change over time, reconfiguration is done periodically and online, i.e., allowing sites to receive new data copies or drop data copies while at the same time transaction processing continues in the system.","PeriodicalId":397103,"journal":{"name":"2008 Symposium on Reliable Distributed Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121962208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extending Paxos/LastVoting with an Adequate Communication Layer for Wireless Ad Hoc Networks","authors":"Fatemeh Borran, R. Prakash, A. Schiper","doi":"10.1109/SRDS.2008.21","DOIUrl":"https://doi.org/10.1109/SRDS.2008.21","url":null,"abstract":"Most papers addressing consensus in wireless ad hoc networks adopt system models similar to those developed for wired networks. These models are focused towards node failures while ignoring link failures, and thus are poorly suited for wireless ad hoc networks. The recently proposed HO model does not have this drawback. The paper shows that an existing algorithm and the HO model can be used for multi-hop wireless ad hoc networks, if extended with an adequate communication layer. The description of the communication layer is augmented with simulation results that validate the feasibility of our approach and provide better understanding of the behavior of wireless environments.","PeriodicalId":397103,"journal":{"name":"2008 Symposium on Reliable Distributed Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133993482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gayatri Swamynathan, Ben Y. Zhao, K. Almeroth, S. Rao Jammalamadaka
{"title":"Towards Reliable Reputations for Dynamic Networked Systems","authors":"Gayatri Swamynathan, Ben Y. Zhao, K. Almeroth, S. Rao Jammalamadaka","doi":"10.1109/SRDS.2008.31","DOIUrl":"https://doi.org/10.1109/SRDS.2008.31","url":null,"abstract":"A new generation of distributed systems and applications rely on the cooperation of diverse user populations motivated by self-interest. While they can utilize \"reputation systems\" to reduce selfish behaviors that disrupt or manipulate the network for personal gain, current reputations face a key challenge in large dynamic networks: vulnerability to peer collusion. In this paper, we propose to dramatically improve the accuracy of reputation systems with the use of a statistical metric that measures the \"reliability\" of a peer's reputation taking into account collusion-like behavior. Trace-driven simulations on P2P network traffic show that our reliability metric drastically improves system performance. We also apply our metric to 18,000 randomly selected eBay user reputation profiles, and surprisingly discover numerous users with collusion-like behaviors worthy of additional investigation.","PeriodicalId":397103,"journal":{"name":"2008 Symposium on Reliable Distributed Systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128009254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application-Level Recovery Mechanisms for Context-Aware Pervasive Computing","authors":"D. Kulkarni, A. Tripathi","doi":"10.1109/SRDS.2008.13","DOIUrl":"https://doi.org/10.1109/SRDS.2008.13","url":null,"abstract":"We identify here various kinds of failure conditions and robustness issues that arise in context-aware pervasive computing applications. Such conditions are related to failures in an application's interactions with ambient services, failures in resource discovery and binding, and invalidation of context conditions during the execution of an application task. In this paper we present an exception handling model for integrating forward error recovery mechanisms in the designs of such applications. This model is integrated in a role-based framework and supported by a programming environment for construction of such applications.","PeriodicalId":397103,"journal":{"name":"2008 Symposium on Reliable Distributed Systems","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131333031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Formalizing System Behavior for Evaluating a System Hang Detector","authors":"Long Wang, Z. Kalbarczyk, R. Iyer","doi":"10.1109/SRDS.2008.11","DOIUrl":"https://doi.org/10.1109/SRDS.2008.11","url":null,"abstract":"This paper presents an approach to formally verify the detection capability of a system hang detector. To achieve this goal, an abstract formal model of a typical Linux system is created to thoroughly exercise all execution scenarios that may lead to hangs. The goal is to expose cases (i.e., hang scenarios) that escape detection. Our system model abstracts the basic hardware (e.g., timer, hardware counter) and software (e.g., processes/threads) components present in the Linux system. The model enables: (i) capturing behavior of these components so as to depict execution scenarios that lead to hangs, and (ii) evaluating hang detection coverage. Explicit-state model checking is applied to reason about system behavior and uncover hang scenarios that escape detection. The results indicate that the proposed framework allows identification of corner cases of hang scenarios that escape detection and provides valuable insight to developers for enhancing detection mechanisms.","PeriodicalId":397103,"journal":{"name":"2008 Symposium on Reliable Distributed Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127112331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rekha Bachwani, Leszek Gryz, R. Bianchini, C. Dubnicki
{"title":"Dynamically Quantifying and Improving the Reliability of Distributed Storage Systems","authors":"Rekha Bachwani, Leszek Gryz, R. Bianchini, C. Dubnicki","doi":"10.1109/SRDS.2008.36","DOIUrl":"https://doi.org/10.1109/SRDS.2008.36","url":null,"abstract":"In this paper, we argue that the reliability of large-scale storage systems can be significantly improved by using better reliability metrics and more efficient policies for recovering from hardware failures. Specifically, we make three main contributions. First, we introduce NDS (Normalcy Deviation Score), a new metric for dynamically quantifying the reliability status of a storage system. Second, we propose MinI (Minimum Intersection), a novel recovery scheduling policy that improves reliability by efficiently reconstructing data after a hardware failure. MinI uses NDS to tradeoff reliability and performance in making its scheduling decisions. Third, we evaluate NDS and MinI for three common data-allocation schemes and a number of different parameters. Our evaluation focuses on a distributed storage system based on erasure codes. We find that MinI improves reliability significantly, as compared to conventional policies.","PeriodicalId":397103,"journal":{"name":"2008 Symposium on Reliable Distributed Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127519604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Absolute-Relative Risk Assessment Methodology Approach to Current Safety Critical Systems and its Application to the ADS-B based Air Traffic Control System","authors":"L. Vismari, J. Camargo","doi":"10.1109/SRDS.2008.24","DOIUrl":"https://doi.org/10.1109/SRDS.2008.24","url":null,"abstract":"This work presents a risk assessment methodology, preliminary proposed in [1], which is the fusion of the \"absolute\" and the \"relative\" risk assessment methods adopted by the International Civil Aviation Organization. The proposed methodology uses the Fluid Stochastic Petri Net (FSPN) as modeling formalism, and compares the safety metrics estimated from the simulation of both the proposed and the legacy system models. It was applied to assess the safety properties of a new air traffic surveillance concept, named \"automatic dependent surveillance - broadcasting\" (ADS-B). As conclusions, the proposed methodology assured to assess the safety properties of systems based on the current safety critical system paradigm - especially concerning the air transportation system. Besides, the FSPN formalism provided important modeling capabilities and discrete event simulation allowing estimating the desired safety metrics. Finally, the ADS-B (proposed system) has significantly reduced the risks of separation losses between aircrafts if compared to the usual surveillance radar systems (legacy system) in air traffic control (ATC) environment.","PeriodicalId":397103,"journal":{"name":"2008 Symposium on Reliable Distributed Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129257975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. M. Bernabé-Gisbert, Vaide Zuikeviciute, F. D. Muñoz-Escoí, F. Pedone
{"title":"A Probabilistic Analysis of Snapshot Isolation with Partial Replication","authors":"J. M. Bernabé-Gisbert, Vaide Zuikeviciute, F. D. Muñoz-Escoí, F. Pedone","doi":"10.1109/SRDS.2008.10","DOIUrl":"https://doi.org/10.1109/SRDS.2008.10","url":null,"abstract":"Snapshot isolation has received a considerable amount of attention in the context of full database replication. Such popularity is mainly because read-only transactions executing under snapshot isolation are never blocked or aborted. In partial replication, where each replica holds only a part of the database, transactions may require access to remote databases. Each remote read operation of the transaction must execute in a consistent global database snapshot as the local operations; if such a snapshot is not available, the transaction must be aborted. In this paper we are interested in the effects of distributed transactions on the abort rate of partially replicated snapshot isolation systems. We present a simple probabilistic analysis of transaction abort rates for two different concurrency control mechanisms: lock- and version-based. The former models the behavior of a replication protocol providing one-copy-serializability; the latter models snapshot isolation. Our analysis reveals that in the version-based system the execution abort rate decreases exponentially as the number of data versions available increases. As a consequence, in all cases considered, two versions of each data item were sufficient to eliminate aborts due to distributed transactions.","PeriodicalId":397103,"journal":{"name":"2008 Symposium on Reliable Distributed Systems","volume":"242 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122128761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}