Nentawe Gurumdimma, A. Jhumka, Maria Liakata, Edward Chuah, J. Browne
{"title":"CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems","authors":"Nentawe Gurumdimma, A. Jhumka, Maria Liakata, Edward Chuah, J. Browne","doi":"10.1109/SRDS.2016.017","DOIUrl":"https://doi.org/10.1109/SRDS.2016.017","url":null,"abstract":"The use of console logs for error detection in large scale distributed systems has proven to be useful to system administrators. However, such logs are typically redundant and incomplete, making accurate detection very difficult. In an attempt to increase this accuracy, we complement these incomplete console logs with resource usage data, which captures the resource utilisation of every job in the system. We then develop a novel error detection methodology, the CRUDE approach, that makes use of both the resource usage data and console logs. We thus make the following specific technical contributions: we develop (i) a clustering algorithm to group nodes with similar behaviour, (ii) an anomaly detection algorithm to identify jobs with anomalous resource usage, (iii) an algorithm that links jobs with anomalous resource usage with erroneous nodes. We then evaluate our approach using console logs and resource usage data from the Ranger Supercomputer. Our results are positive: (i) our approach detects errors with a true positive rate of about 80%, and (ii) when compared with the well-known Nodeinfo error detection algorithm, our algorithm provides an average improvement of around 85% over Nodeinfo, with a best-case improvement of 250%.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128384497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visualizing and Controlling VMI-Based Malware Analysis in IaaS Cloud","authors":"Noëlle Rakotondravony, Hans P. Reiser","doi":"10.1109/SRDS.2016.035","DOIUrl":"https://doi.org/10.1109/SRDS.2016.035","url":null,"abstract":"Security in virtualized environment has known the support of different tools in the low-level detection and analysis of malware. The in-guest tracing mechanisms are now capable of operating at assembly language-, system call-, function call-and instruction-level to detect and classify malicious activities. Therefore, they are producing large amount of data about the state of a target system. However, the integrity of such data becomes questionable whenever the hosting target system is compromised. With virtual machine introspection (VMI), the monitoring tool runs outside the target monitored virtual machine (VM) [1]. Thus, the integrity of retrieved data is ensured even if the target system is compromised. Various works have brought VMI to Infrastructure-as-a-Service (Iaas) cloud environment, allowing the cloud user to run (simultaneous) forensics operations on his production VMs. The associated tracing mechanisms can collect larger amount of data in form of commented behavior traces or unstandardized log records. Thus, a human operator is needed to efficiently parse, represent, visualize and interpret the collected data, to benefit from their security relevance [2]. The use of visualization helps analysts investigate, compare and culster malware samples [3]. Existing visualization tools make use of recorded information to enhance the detection of intrusive behavior or the clustering of malware [4] from the observed system. However, at our knowledge, no existing tools establish a pre-to post-exploitation visualization graphs. We present an approach that enhances the detection and analysis of malware in the cloud by providing the cloud end-users the mean to efficiently visualize the different security relevant data collected through multiple VMI-based mechanisms.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134119233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Including Security Monitoring in Cloud Service Level Agreements","authors":"Amir Teshome, Louis Rilling, C. Morin","doi":"10.1109/SRDS.2016.034","DOIUrl":"https://doi.org/10.1109/SRDS.2016.034","url":null,"abstract":"Service providers give assurance on some aspects of the service but, as of today, security monitoring is not one of them. In our work, we aim to allow providers to provide customers with guarantees on security monitoring of their outsourced information system.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116184478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lateral Movement Detection Using Distributed Data Fusion","authors":"Ahmed M. Fawaz, Atul Bohara, C. Cheh, W. Sanders","doi":"10.1109/SRDS.2016.014","DOIUrl":"https://doi.org/10.1109/SRDS.2016.014","url":null,"abstract":"Attackers often attempt to move laterally from host to host, infecting them until an overall goal is achieved. One possible defense against this strategy is to detect such coordinated and sequential actions by fusing data from multiple sources. In this paper, we propose a framework for distributed data fusion that specifies the communication architecture and data transformation functions. Then, we use this framework to specify an approach for lateral movement detection that uses host-level process communication graphs to infer network connection causations. The connection causations are then aggregated into system-wide host-communication graphs that expose possible lateral movement in the system. In order to provide a balance between the resource usage and the robustness of the fusion architecture, we propose a multilevel fusion hierarchy that uses different clustering techniques. We evaluate the scalability of the hierarchical fusion scheme in terms of storage overhead, number of message updates sent, fairness of resource sharing among clusters, and quality of local graphs. Finally, we implement a host-level monitor prototype to collect connection causations, and evaluate its overhead. The results show that our approach provides an effective method to detect lateral movement between hosts, and can be implemented with acceptable overhead.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116240212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Norton Zone: Symantec's Secure Cloud Storage System","authors":"Walter Bogorad, Scott Schneider, Haibin Zhang","doi":"10.1109/SRDS.2016.020","DOIUrl":"https://doi.org/10.1109/SRDS.2016.020","url":null,"abstract":"Cloud storage services are the way of the future, if not the present, but broad adoption is limited by a stark trade-off between privacy and functionality. Many popular cloud services provide search capabilities, but make only nominal efforts to keep user data fully private. Alternatives that search private user data on an untrusted server sacrifice functionality and/or scalability. We describe Norton Zone, Symantec's secure and scalable public storage system based on our valet security model. Whereas most commercial cloud storage systems secure user data with access control and legal mechanisms, Zone's cryptographic techniques provide proven privacy guarantees. This gives users an extra layer of security without compromising functionality. Zone's performance is comparable to unencrypted cloud storage systems that support search and sharing. We report on the design of Zone and the lessons learned in developing and deploying it in commercial, distributed datacenters scalable to millions of users.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129889579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inside-Out: Reliable Performance Prediction for Distributed Storage Systems in the Cloud","authors":"Chin-Jung Hsu, R. Panta, Moo-Ryong Ra, V. Freeh","doi":"10.1109/SRDS.2016.025","DOIUrl":"https://doi.org/10.1109/SRDS.2016.025","url":null,"abstract":"Many storage systems are undergoing a significant shift from dedicated appliance-based model to software-defined storage (SDS) because the latter is flexible, scalable and cost-effective for modern workloads. However, it is challenging to provide a reliable guarantee of end-to-end performance in SDS due to complex software stack, time-varying workload and performance interference among tenants. Therefore, modeling and monitoring the performance of storage systems is critical for ensuring reliable QoS guarantees. Existing approaches such as performance benchmarking and analytical modeling are inadequate because they are not efficient in exploring large configuration space, and cannot support elastic operations and diverse storage services in SDS. This paper presents Inside-Out, an automatic model building tool that creates accurate performance models for distributed storage services. Inside-Out is a black-box approach. It builds high-level performance models by applying machine learning techniques to low-level system performance metrics collected from individual components of the distributed SDS system. Inside-Out uses a two-level learning method that combines two machine learning models to automatically filter irrelevant features, boost prediction accuracy and yield consistent prediction. Our in-depth evaluation shows that Inside-Out is a robust solution that enables SDS to predict end-to-end performance even in challenging conditions, e.g., changes in workload, storage configuration, available cloud resources, size of the distributed storage service, and amount of interference due to multi-tenants. Our experiments show that Inside-Out can predict end-to-end performance with 91.1% accuracy on average. Its prediction accuracy is consistent across diverse storage environments.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129099510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VMBeam: Zero-Copy Migration of Virtual Machines for Virtual IaaS Clouds","authors":"Kenichi Kourai, H. Ooba","doi":"10.1109/SRDS.2016.024","DOIUrl":"https://doi.org/10.1109/SRDS.2016.024","url":null,"abstract":"Virtual Infrastructure-as-a-Service (IaaS) clouds are emerging for secondary cloud service providers to manage their own IaaS clouds on top of existing IaaS clouds. In virtual IaaS clouds, guest virtual machines (VMs) run inside cloud VMs provided by existing IaaS clouds. Unlike traditional IaaS clouds, they can be migrated between cloud VMs co-located at the same host. However, the performance of such VM migration is low due to slow virtual networks and doubled system loads. To optimize VM migration between co-located cloud VMs, we propose zero-copy migration for virtual IaaS clouds. Zero-copy migration just relocates the memory image of a guest VM without any copy. To enable live migration with negligible downtime, it first makes the memory of a guest VM share with the destination cloud VM and thereafter completes memory relocation. We have implemented a system called VMBeam for enabling zero-copy migration in Xen. According to our experimental results, zero-copy migration could achieve high migration performance and low system loads.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130142959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tale of Tails: Anomaly Avoidance in Data Centers","authors":"Ji Xue, R. Birke, L. Chen, E. Smirni","doi":"10.1109/SRDS.2016.021","DOIUrl":"https://doi.org/10.1109/SRDS.2016.021","url":null,"abstract":"It is a common practice that today's cloud data centers guard the performance by monitoring the resource usage, e.g., CPU and RAM, and issuing anomaly tickets whenever detecting usages exceeding predefined target values. Ensuring free of such usage anomaly can be extremely challenging, while catering to a large amount of virtual machines (VMs) showing bursty workloads on a limited amount of physical resource. Using resource usage data from production data centers that consist of more than 6K physical machines hosting more than 80K VMs, we identify statistic properties of anomaly instances (AIs) on physical servers, highlighting their burst duration and potential root causes. To strike a tradeoff between a strong performance guarantee and resource provisions, we propose a tail-driven anomaly avoidance policy for boxes, TailGuard, which allows a small fraction of AIs, e.g., 5% of usages can be above the target value, and still avoid severe performance degradation, typically caused by a burst of continuous AI. Specifically, TailGuard first introduces a novel usage tail prediction that explores the similarity patterns across a great number of boxes within a very recent history, and then redistributes the server load in an online fashion by proactive VM cloning and reactive load balancing. Evaluation results show that TailGuard can not only achieve an accuracy comparable with prediction methodology that relies on long history of usage data but also dramatically reduce the number of CPU AIs by 60%, with a tenfold reduction of their duration, from more than 25 time windows to only 2.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129191842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dorian Burihabwa, Rogério Pontes, P. Felber, Francisco Maia, H. Mercier, R. Oliveira, J. Paulo, V. Schiavoni
{"title":"On the Cost of Safe Storage for Public Clouds: An Experimental Evaluation","authors":"Dorian Burihabwa, Rogério Pontes, P. Felber, Francisco Maia, H. Mercier, R. Oliveira, J. Paulo, V. Schiavoni","doi":"10.1109/SRDS.2016.028","DOIUrl":"https://doi.org/10.1109/SRDS.2016.028","url":null,"abstract":"Cloud-based storage services such as Dropbox, Google Drive and OneDrive are increasingly popular for storing enterprise data, and they have already become the de facto choice for cloud-based backup of hundreds of millions of regular users. Drawn by the wide range of services they provide, no upfront costs and 24/7 availability across all personal devices, customers are well-aware of the benefits that these solutions can bring. However, most users tend to forget—or worse ignore—some of the main drawbacks of such cloud-based services, namely in terms of privacy. Data entrusted to these providers can be leaked by hackers, disclosed upon request from a governmental agency's subpoena, or even accessed directly by the storage providers (e.g., for commercial benefits). While there exist solutions to prevent or alleviate these problems, they typically require direct intervention from the clients, like encrypting their data before storing it, and reduce the benefits provided such as easily sharing data between users. This practical experience report studies a wide range of security mechanisms that can be used atop standard cloud-based storage services. We present the details of our evaluation testbed and discuss the design choices that have driven its implementation. We evaluate several state-of-the-art techniques with varying security guarantees responding to user-assigned security and privacy criteria. Our results reveal the various trade-offs of the different techniques by means of representative workloads on top of industry-grade storage services.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116188631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collaborative Stabilization","authors":"Mohammad Roohitavaf, S. Kulkarni","doi":"10.1109/SRDS.2016.043","DOIUrl":"https://doi.org/10.1109/SRDS.2016.043","url":null,"abstract":"In this paper, we present the paradigm of collaborative stabilization that focuses on providing stabilization in the presence of an essential but potentially disruptive environment. By essential, we mean that without the environment actions, stabilization property would be impossible. At the same time, environment actions are not in the control of the program and can be disruptive to the recovery. We demonstrate the need for collaborative stabilization by providing examples where existing paradigms of stabilization are undesirable/insufficient. We compare collaborative stabilization with existing paradigms of stabilization. We identify the complexity of verifying collaborative stabilizing programs and develop theorems that focus on composition of such programs.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123441204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}