Daniel W. Sun, Daniel Guimarans, A. Fekete, V. Gramoli, Liming Zhu
{"title":"Multi-objective Optimisation of Rolling Upgrade Allowing for Failures in Clouds","authors":"Daniel W. Sun, Daniel Guimarans, A. Fekete, V. Gramoli, Liming Zhu","doi":"10.1109/SRDS.2015.37","DOIUrl":"https://doi.org/10.1109/SRDS.2015.37","url":null,"abstract":"Rolling upgrade is a practical industry technique for online updating of software in distributed systems. This paper focuses on rolling upgrade of software versions in virtual machine instances on cloud computing platforms, when various failures may occur. An operator can choose the number of instances that are updated in one round and system environments to minimise completion time, availability degradation, and monetary cost for entire rolling upgrade, and hence this is a multi-objective optimisation problem. To predict completion time in the presence of failures, we offer a stochastic model that represents the dynamics of rolling upgrade. To reduce the computational effort of decision making for large scale complex systems, we propose a technique that can find a Pareto set quickly via an upper bound of the expected completion time. Then an optimum of the original problem can be chosen from this set of potential solutions. We validate our approach to minimise the objectives, through both experiments in Amazon Web Service (AWS) and simulations.","PeriodicalId":244925,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132045964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Liu, D. Feng, Hong Jiang, Yuchong Hu, Tianfeng Jiao
{"title":"Z Codes: General Systematic Erasure Codes with Optimal Repair Bandwidth and Storage for Distributed Storage Systems","authors":"Qing Liu, D. Feng, Hong Jiang, Yuchong Hu, Tianfeng Jiao","doi":"10.1109/SRDS.2015.18","DOIUrl":"https://doi.org/10.1109/SRDS.2015.18","url":null,"abstract":"Erasure codes are widely used in distributed storage systems to prevent data loss. Traditional erasure codes suffer from a typical repair-bandwidth problem in which the amount of data required to reconstruct the lost data, referred to as the repair bandwidth, is often far more than the theoretical minimum. While many novel erasure codes have been proposed in recent years to reduce the repair bandwidth, these codes either require extra storage capacity and computation overhead or are only applicable to some special cases. To address the weaknesses of the existing solutions to the repair-bandwidth problem, we propose Z Codes, a general family of codes capable of achieving the theoretical lower bound of repair bandwidth for a single data node failure. To the best of our knowledge, the Z codes are the first general systematic erasure codes that achieve optimal repair bandwidth under the minimum storage. Our in-memory performance evaluations of a 1-GB file indicate that Z codes have encoding and repairing speeds that are approximately equal to those of the Reed-Solomon (RS) codes, and their speed on the order of GB/s practically removes computation as a performance bottleneck.","PeriodicalId":244925,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116617015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Replacement: Decentralized Failure Handling for Replicated State Machines","authors":"Leander Jehl, T. E. Lea, H. Meling","doi":"10.1109/SRDS.2015.29","DOIUrl":"https://doi.org/10.1109/SRDS.2015.29","url":null,"abstract":"We investigate methods for handling failures in a Paxos State Machine and introduce Replacement, a novel approach to handle failures. Replacement is fully decentralized and does not rely on consensus. This allows failed replicas to be replaced quickly, avoiding the bottleneck of a single leader. Instead of handling failures in the order proposed by a leader, concurrent replacements are combined to guarantee that all failed replicas are replaced. Replacement also allows the state machine to process client requests during failure handling, even while disagreeing on the current configuration. As our evaluation shows, this enables Replacement to quickly handle failures, with minimal disruption in the processing of client requests.","PeriodicalId":244925,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122487364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guido Lena Cota, Sonia Ben Mokhtar, J. Lawall, Gilles Muller, G. Gianini, E. Damiani, L. Brunie
{"title":"A Framework for the Design Configuration of Accountable Selfish-Resilient Peer-to-Peer Systems","authors":"Guido Lena Cota, Sonia Ben Mokhtar, J. Lawall, Gilles Muller, G. Gianini, E. Damiani, L. Brunie","doi":"10.1109/SRDS.2015.36","DOIUrl":"https://doi.org/10.1109/SRDS.2015.36","url":null,"abstract":"A challenge in designing a peer-to-peer (P2P) system is to ensure that the system is able to tolerate selfish nodes that strategically deviate from their specification whenever doing so is convenient. In this paper, we propose RACOON, a framework for the design of P2P systems that are resilient to selfish behaviours. While most existing solutions target specific systems or types of selfishness, RACOON proposes a generic and semi-automatic approach that achieves robust and reusable results. Also, RACOON supports the system designer in the performance-oriented tuning of the system, by proposing a novel approach that combines Game Theory and simulations. We illustrate the benefits of using RACOON by designing two P2P systems: a live streaming and an anonymous communication system. In simulations and a real deployment of the two applications on a testbed comprising 100 nodes, the systems designed using RACOON achieve both resilience to selfish nodes and high performance.","PeriodicalId":244925,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130274597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniela Cason, Parisa Jalili Marandi, L. E. Buzato, F. Pedone
{"title":"Chasing the Tail of Atomic Broadcast Protocols","authors":"Daniela Cason, Parisa Jalili Marandi, L. E. Buzato, F. Pedone","doi":"10.1109/SRDS.2015.28","DOIUrl":"https://doi.org/10.1109/SRDS.2015.28","url":null,"abstract":"Many applications today rely on multiple services, whose results are combined to form the application's response. In such contexts, the most unreliable service and the slowest service determine the application's reliability and response time, respectively. State-machine replication and atomic broadcast are fundamental abstractions to build highly available services. In this paper, we consider the latency variability of atomic broadcast protocols. This is important because atomic broadcast has a direct impact on the response time of services. We study four high performance atomic broadcast protocols representative of different classes of protocol design and characterize their latency tail distribution under different workloads. Next, we assess how key design features of each protocol can possibly be related to the observed latency tail distributions. Our observations hint at request batching as a simple yet effective way to shorten the latency tails of some of the studied protocols, an improvement within the reach of application implementers. Indeed, our observation is not only verified experimentally, it allows us to assess which of the protocol's key design principles favor the construction of latency predictable protocols.","PeriodicalId":244925,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS)","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132910274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PmDroid: Permission Supervision for Android Advertising","authors":"Xing Gao, Dachuan Liu, Haining Wang, Kun Sun","doi":"10.1109/SRDS.2015.41","DOIUrl":"https://doi.org/10.1109/SRDS.2015.41","url":null,"abstract":"It is well-known that Android mobile advertising networks may abuse their host applications' permission to collect private information. Since the advertising library and host app are running in the same process, the current Android permission mechanism cannot prevent an ad network from collecting private data that is out of an ad network's permission range. In this paper, we propose PmDroid to protect the data that is not under the scope of the ad network's permission set. PmDroid can block the data from being sent to advertising servers at the occurrence of permission violation in ad networks. Moreover, we utilize PmDroid to assess how serious the permission violation problem is in the ad networks. We first implement 53 sample apps using a single ad network library. We grant all permissions of Android 4.3 to these apps and record the data sent to the Internet. Then, we further analyze 430 published market apps. In total, there are 76 ad networks identified in our experiments. We compare the permission of data received by these ad networks with their official documents. Our experimental results indicate that the permission violation is a real problem in existing ad network markets.","PeriodicalId":244925,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114768527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Signature-Based Top-k Query Processing against Data Replacement Attacks in MANETs","authors":"Takuji Tsuda, Yuka Komai, T. Hara, S. Nishio","doi":"10.1109/SRDS.2015.34","DOIUrl":"https://doi.org/10.1109/SRDS.2015.34","url":null,"abstract":"In this paper, we propose a signature-based top-k query processing method against data replacement attacks in mobile ad hoc networks (MANETs). In order to rapidly identify a greater number of malicious nodes, nodes share information about identified malicious nodes with other nodes. If nodes share only this information, however, malicious nodes may successfully transmit false information identifying normal nodes as malicious. Therefore, in the proposed method, when nodes send reply messages during query processing, they attach encrypted information about the sent data items (i.e., digital signatures), providing the query-issuing node with critical information about the data items sent by nodes in the network, and thereby enabling it to identify malicious nodes, using the received signatures. After identifying the malicious nodes, it floods the network with a notification message including the signatures in which the identified malicious nodes have replaced higher-score data items to their own lower-score items.","PeriodicalId":244925,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125136196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Dzung, R. Guerraoui, David Kozhaya, Y. Pignolet
{"title":"To Transmit Now or Not to Transmit Now","authors":"D. Dzung, R. Guerraoui, David Kozhaya, Y. Pignolet","doi":"10.1109/SRDS.2015.26","DOIUrl":"https://doi.org/10.1109/SRDS.2015.26","url":null,"abstract":"Given an unreliable communication link, this paper studies how to build, in an energy-efficient manner, a reliable communication service that is synchronous with high probability. We consider a Partially Observable Markov Decision Process (POMDP) setting in which a communication link's transmission quality: (i) changes according to a classic Markovian model and (ii) can be only partially observed, through feedback relative to previous transmissions. We perform a thorough analysis under several variations of Ack/Nack feedback mechanisms. Despite the general intractability of POMDPs, we prove that our communication service, under reliable feedback, can be inexpensively implemented. We obtain closed form solutions specifying when to transmit over the link, which allows to derive an energy-optimal implementation. We also analyse the impact of lossy feedback on implementing our communication service. Considering multiple lossy feedback mechanisms, we show that an easily implementable structure for our communication service can also be obtained, depending on the feedback mechanism itself.","PeriodicalId":244925,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124343491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiyi Li, Q. Cao, Lei Tian, Shenggang Wan, Lu Qian, C. Xie
{"title":"PSG-Codes: An Erasure Codes Family with High Fault Tolerance and Fast Recovery","authors":"Shiyi Li, Q. Cao, Lei Tian, Shenggang Wan, Lu Qian, C. Xie","doi":"10.1109/SRDS.2015.39","DOIUrl":"https://doi.org/10.1109/SRDS.2015.39","url":null,"abstract":"As hard disk failure rates are rarely improved and the reconstruction time for TB-level disks typically amounts to days, multiple concurrent disk/storage node failures in datacenter storage systems become common and frequent. As a result, the erasure coding schemes used in datacenters must meet the critical requirements of high fault tolerance, high storage efficiency, and fast fault recovery. In this paper, we introduce a new XOR-based non-MDS erasure code family with an ability of tolerating up to 12-disk/node failures, called PSG-Codes. The basic idea behind PSG-Codes is to partition disks into groups, and exploit short parity chains to generate parity units. Then, the parity chain is further shortened by varying the number of parity elements for each strip. We conduct a simulation-based study to search configuration parameter space of PSG-Codes, and prove that PSG-Codes can tolerate up to 12 disk/node failures. Compared with a well-known XOR-based non-MDS code, WEAVER codes, PSG-Codes have higher storage efficiency and lower reconstruction cost. Moreover, the storage efficiency and performance of PSG-Codes are also competitive with another stat-of-the-art GF-based non-MDS codes, LRC codes.","PeriodicalId":244925,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127036624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reza Hajisheykhi, Mohammad Roohitavaf, S. Kulkarni
{"title":"Auditable Restoration of Distributed Programs","authors":"Reza Hajisheykhi, Mohammad Roohitavaf, S. Kulkarni","doi":"10.1109/SRDS.2015.24","DOIUrl":"https://doi.org/10.1109/SRDS.2015.24","url":null,"abstract":"We focus on a protocol for auditable restoration of distributed systems. The need for such protocol arises due to conflicting requirements (e.g., access to the system should be restricted but emergency access should be provided). One can design such systems with a tamper detection approach (based on the intuition of \"break the glass door\"). However, in a distributed system, such tampering, which are denoted as auditable events, is visible only for a single node. This is unacceptable since the actions they take in these situations can be different than those in the normal mode. Moreover, eventually, the auditable event needs to be cleared so that system resumes the normal operation. With this motivation, in this paper, we present a protocol for auditable restoration, where any process can potentially identify an auditable event. Whenever a new auditable event occurs, the system must reach an \"auditable state\" where every process is aware of the auditable event. Only after the system reaches an auditable state, it can begin the operation of restoration. Although any process can observe an auditable event, we require that only \"authorized\" processes can begin the task of restoration. Moreover, these processes can begin the restoration only when the system is in an auditable state. Our protocol is self-stabilizing and can effectively handle the case where faults or auditable events occur during the restoration protocol. Moreover, it can be used to provide auditable restoration to other distributed protocol.","PeriodicalId":244925,"journal":{"name":"2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121596107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}