Jing Li, Rebecca J. Stones, G. Wang, Zhongwei Li, X. Liu, Kang Xiao
{"title":"Being Accurate Is Not Enough: New Metrics for Disk Failure Prediction","authors":"Jing Li, Rebecca J. Stones, G. Wang, Zhongwei Li, X. Liu, Kang Xiao","doi":"10.1109/SRDS.2016.019","DOIUrl":"https://doi.org/10.1109/SRDS.2016.019","url":null,"abstract":"Traditionally, disk failure prediction accuracy is used to evaluate disk failure prediction model. However, accuracy may not reflect their practical usage (protecting against failures, rather than only predicting failures) in cloud storage systems. In this paper, we propose two new metrics for disk failure prediction models: migration rate, which measures how much at-risk data is protected as a result of correct failure predictions, and mismigration rate, which measures how much data is migrated needlessly as a result of false failure predictions. To demonstrate their effectiveness, we compare disk failure prediction methods: (a) a classification tree (CT) model vs. a state-of-the-art recurrent neural network (RNN) model, and (b) a proposed residual life prediction model based on gradient boosted regression trees (GBRTs) vs. RNN. While prediction accuracy experiments favor the RNN model, migration rate experiments can favor the CT and GBRT models (depending on transfer rates). We conclude that prediction accuracy can be a misleading metric. Moreover, the proposed GBRT model offers a practical improvement in disk failure prediction in real-world data centers.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126956097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vincent Primault, A. Boutet, Sonia Ben Mokhtar, L. Brunie
{"title":"Adaptive Location Privacy with ALP","authors":"Vincent Primault, A. Boutet, Sonia Ben Mokhtar, L. Brunie","doi":"10.1109/SRDS.2016.044","DOIUrl":"https://doi.org/10.1109/SRDS.2016.044","url":null,"abstract":"With the increasing amount of mobility data being collected on a daily basis by location-based services (LBSs) comes a new range of threats for users, related to the over-sharing of their location information. To deal with this issue, several location privacy protection mechanisms (LPPMs) have been proposed in the past years. However, each of these mechanisms comes with different configuration parameters that have a direct impact both on the privacy guarantees offered to the users and on the resulting utility of the protected data. In this context, it can be difficult for non-expert system designers to choose the appropriate configuration to use. Moreover, these mechanisms are generally configured once for all, which results in the same configuration for every protected piece of information. However, not all users have the same behaviour, and even the behaviour of a single user is likely to change over time. To address this issue, we present in this paper ALP (which stands for Adaptive Location Privacy), a new framework enabling the dynamic configuration of LPPMs. ALP can be used in two scenarios: (1) offline, where ALP enables a system designer to choose and automatically tune the most appropriate LPPM for the protection of a given dataset, (2) online, where ALP enables the user of a crowd sensing application to protect consecutive batches of her geolocated data by automatically tuning a given LPPM to fulfil a set of privacy and utility objectives. We evaluate ALP on both scenarios with two real-life mobility datasets and two state-of-the-art LPPMs. Our experiments show that the adaptive LPPM configurations found by ALP outperform static configurations in terms of trade-off between privacy and utility.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126558733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juntao Fang, Shenggang Wan, Ping-Hsiu Huang, Xubin He, C. Xie
{"title":"Achieving High Reliability via Expediting the Repair of Critical Blocks in Replicated Storage Systems","authors":"Juntao Fang, Shenggang Wan, Ping-Hsiu Huang, Xubin He, C. Xie","doi":"10.1109/SRDS.2016.018","DOIUrl":"https://doi.org/10.1109/SRDS.2016.018","url":null,"abstract":"High reliability is critical to large data centers consisting of hundreds to thousands of storage nodes where node failures are not rare. Data replication is a typical technique deployed to achieve high reliability. When a node failure is detected, blocks with lost replicas are identified and recovered. Long timeouts are usually used for node failure detection. For blocks with one lost replica, the long timeouts can significantly reduce network traffic induced by data recovery. However, for blocks with two or more lost replicas, which can be caused by concurrent node failures that are not rare in large data centers, the long timeouts will result in a high risk of loss of these blocks. In this paper, we propose MFR to separate the identification of the blocks with two or more lost replicas from that of the blocks with one lost replica in a way that the identification of the blocks with two or more replicas can be accelerated while that of the blocks with one lost replica stays the same. Consequently, MFR can significantly improve data reliability while keeping the network traffic induced by data recovery stable. The results from our simulation and prototype implementation show that MFR improves the reliability of storage systems by a factor of up to 4.0 in terms of mean time to data loss. As blocks with two or more lost replicas are far fewer than blocks with one lost replica, the extra network traffic caused by MFR is less than 0.54% of total network traffic for data recovery.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131176416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Li, Jing Li, Rebecca J. Stones, G. Wang, Zhongwei Li, X. Liu
{"title":"ProCode: A Proactive Erasure Coding Scheme for Cloud Storage Systems","authors":"Peng Li, Jing Li, Rebecca J. Stones, G. Wang, Zhongwei Li, X. Liu","doi":"10.1109/SRDS.2016.039","DOIUrl":"https://doi.org/10.1109/SRDS.2016.039","url":null,"abstract":"Common distributed storage systems use data replication to improve system reliability and maintain data availability, but at the cost of disk storage. In order to lower storage costs, data may instead be stored according to erasure codes, but this results in greater network and disk traffic when data blocks are reconstructed following an erasure. These methods are also passive, i.e., they only reconstruct data after failures occur. In this paper, we present a proactive erasure coding scheme (ProCode). We monitor the health of disks via drive failure prediction and automatically adjust the replication factor of data blocks on at-risk disks to ensure data safety. In this way, we achieve fast recovery after disk failures without significantly increasing the storage overhead. ProCode is implemented as an extension to HDFS-RAID used by Facebook. Compared with replication storage and erasure coding, ProCode improves system reliability and availability. Specifically, experimental results show 2 or more orders of magnitude reduction in the average number of data loss events over a 10- year period, a 63% or greater drop in degraded read latency, and a 78% drop in recovery time.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131331125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Guerraoui, David Kozhaya, M. Oriol, Y. Pignolet
{"title":"Who's On Board?: Probabilistic Membership for Real-Time Distributed Control Systems","authors":"R. Guerraoui, David Kozhaya, M. Oriol, Y. Pignolet","doi":"10.1109/SRDS.2016.029","DOIUrl":"https://doi.org/10.1109/SRDS.2016.029","url":null,"abstract":"To increase their dependability, distributed control systems (DCSs) need to agree in real time about which hosts have crashed, i.e., they need a real-time membership service. In this paper, we prove that such a service cannot be implemented deterministically if, besides host crashes, communication can also fail. We define implementable probabilistic variants of membership properties, which constitute what we call a synchronous membership service (SYMS). We present an algorithm, ViewSnoop, that implements SYMS with high-probability. We implement, deploy and evaluate ViewSnoop analytically as well as experimentally, within an industrial DCS framework. We show that ViewSnoop significantly improves the dependability of DCSs compared to membership schemes based on classic heartbeats, at low additional cost. Moreover, ViewSnoop distinguishes, with high probability, host crashes from message losses, enabling DCSs to counteract losses better than existing approaches.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114874244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Experiments with Self-Stabilizing Distributed Data Fusion","authors":"B. Ducourthial, V. Berge-Cherfaoui","doi":"10.1109/SRDS.2016.046","DOIUrl":"https://doi.org/10.1109/SRDS.2016.046","url":null,"abstract":"The Theory of Belief Functions is a formal frame-work for reasoning with uncertainty that is well suited for rep-resenting unreliable information and weak states of knowledge. In a previous work, a distributed algorithm for computing data fusion on-the-fly has been introduced, avoiding gathering the data on a single node before computation. In this paper, we present an experimental study of its properties. This algorithm is self-stabilizing and runs on unreliable message passing networks. It converges in finite time whatever is the initialization of the system and for any unknown topology. First we explain the algorithm implementation on an unreliable message passing environment and we implement a simple use-case. Then, by experimenting with this distributed application on a realistic network emulator, we show its interest for enforcing local confidence using close nodes, saving bandwidth and warning dangers. Moreover, we focus on the interesting connections between the data fusion operator and the self-stabilizing properties and we highlight the importance of the discounting.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125068731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Dombrowski, Sebastian Junges, J. Katoen, J. Gross
{"title":"Model-Checking Assisted Protocol Design for Ultra-reliable Low-Latency Wireless Networks","authors":"Christian Dombrowski, Sebastian Junges, J. Katoen, J. Gross","doi":"10.1109/SRDS.2016.048","DOIUrl":"https://doi.org/10.1109/SRDS.2016.048","url":null,"abstract":"Recently, the wireless networking community is getting more and more interested in novel protocol designs for safety-critical applications. These new applications come with unprecedented latency and reliability constraints which poses many open challenges. A particularly important one relates to the question how to develop such systems. Traditionally, development of wireless systems has mainly relied on simulations to identify viable architectures. However, in this case the drawbacks of simulations – in particular increasing run-times – rule out its application. Instead, in this paper we propose to use probabilistic model checking, a formal model-based verification technique, to evaluate different system variants during the design phase. Apart from allowing evaluations and therefore design iterations with much smaller periods, probabilistic model checking provides bounds on the reliability of the considered design choices. We demonstrate these salient features with respect to the novel EchoRing protocol, which is a token-based system designed for safety-critical industrial applications. Several mechanisms for dealing with a token loss are modeled and evaluated through probabilistic model checking, showing its potential as suitable evaluation tool for such novel wireless protocols. In particular, we show by probabilistic model checking that wireless token-passing systems can benefit tremendously from the considered fault-tolerant methods. The obtained performance guarantees for the different mechanisms even provide reasonable bounds for experimental results obtained from a real-world implementation.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121731355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UDS: A Novel and Flexible Scheduling Algorithm for Deterministic Multithreading","authors":"F. Hauck, Gerhard Habiger, Jörg Domaschka","doi":"10.1109/SRDS.2016.030","DOIUrl":"https://doi.org/10.1109/SRDS.2016.030","url":null,"abstract":"Active replication requires deterministic execution in each replica in order to keep them consistent. Debugging and testing need deterministic execution in order to avoid data races and \"Heisenbugs\". Beside input, multi-threading constitutes a major source of nondeterminism. Several deterministic scheduling algorithms exist that allow concurrent but deterministic executions. Yet, these algorithms seem to be very different. Some of them were even developed without knowing the others. In this paper, we present the novel and flexible Unified Deterministic Scheduling algorithm (UDS) for weakly and fully deterministic systems. Compared to existing algorithms, UDS has a broader parameter set, allowing for many configurations that can be used to adapt to a given work load. For the first time, UDS defines reconfiguration of a deterministic scheduler at run-time. Further, we informally show that existing algorithms can be imitated by a particular configuration of UDS, demonstrating its importance.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122406300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Network Aware Reliability Analysis for Distributed Storage Systems","authors":"Amir Epstein, E. K. Kolodner, D. Sotnikov","doi":"10.1109/SRDS.2016.042","DOIUrl":"https://doi.org/10.1109/SRDS.2016.042","url":null,"abstract":"It is hard to measure the reliability of a large distributed storage system, since it is influenced by low probability failure events that occur over time. Nevertheless, it is critical to be able to predict reliability in order to plan, deploy and operate the system. Existing approaches suffer from unrealistic assumptions regarding network bandwidth. This paper introduces a new framework that combines simulation and an analytic model to estimate durability for large distributed cloud storage systems. Our approach is the first that takes into account network bandwidth with a focus on the cumulative effect of simultaneous failures on repair time. Using our framework we evaluate the trade-offs between durability, network and storage costs for the OpenStack Swift object store, comparing various system configurations and resiliency schemes, including replication and erasure coding. In particular, we show that when accounting for the cumulative effect of simultaneous failures, the probability of data loss estimates can vary by two to four orders of magnitude.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121252472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nawanol Theera-Ampornpunt, Tarun Mangla, S. Bagchi, R. Panta, Kaustubh R. Joshi, M. Ammar, E. Zegura
{"title":"TANGO: Toward a More Reliable Mobile Streaming through Cooperation between Cellular Network and Mobile Devices","authors":"Nawanol Theera-Ampornpunt, Tarun Mangla, S. Bagchi, R. Panta, Kaustubh R. Joshi, M. Ammar, E. Zegura","doi":"10.1109/SRDS.2016.047","DOIUrl":"https://doi.org/10.1109/SRDS.2016.047","url":null,"abstract":"Multimedia streaming is a major mobile application, accounting for more than half of total mobile traffic. Streaming applications usually have a static buffering strategy. For example, buffer size is limited to x minutes of the stream, where x is optimized to provide the best trade-off between minimizing stalls and limiting waste of user's bandwidth and energy resulting from user abandonment. We show that such strategies based on information available on the mobile device alone do not work well when network conditions change dynamically, e.g., connectivity degrades due to congestion. We propose an alternative strategy using the framework called TANGO, based on a novel idea of cooperation between cellular network and mobile devices. By monitoring real-time network conditions and continuously predicting user location, our system is able to predict connectivity degradation in the near term. In such events, a notification is sent to the mobile device so that the streaming application can initiate a mitigation action, such as to pre-cache more content. In simulations based on real user traces, we found that TANGO reduces pause time by 13–72%, significantly outperforming DASH, which is the current state of the art.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125680768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}