Haryadi S. Gunawi, M. Hao, Riza O. Suminto, Agung Laksono, A. Satria, J. Adityatama, Kurnia J. Eliazar
{"title":"Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages","authors":"Haryadi S. Gunawi, M. Hao, Riza O. Suminto, Agung Laksono, A. Satria, J. Adityatama, Kurnia J. Eliazar","doi":"10.1145/2987550.2987583","DOIUrl":"https://doi.org/10.1145/2987550.2987583","url":null,"abstract":"We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed 1247 headline news and public post-mortem reports that detail 597 unplanned outages that occurred within a 7-year span from 2009 to 2015. We analyzed outage duration, root causes, impacts, and fix procedures. This study reveals the broader availability landscape of modern cloud services and provides answers to why outages still take place even with pervasive redundancies.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"38 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116153219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Yan, Yanjie Gao, Yang Chen, Zhongxin Guo, Bole Chen, T. Moscibroda
{"title":"TR-Spark: Transient Computing for Big Data Analytics","authors":"Ying Yan, Yanjie Gao, Yang Chen, Zhongxin Guo, Bole Chen, T. Moscibroda","doi":"10.1145/2987550.2987576","DOIUrl":"https://doi.org/10.1145/2987550.2987576","url":null,"abstract":"Large-scale public cloud providers invest billions of dollars into their cloud infrastructure and operate hundreds of thousands of servers across the globe. For various reasons, much of this provisioned server capacity runs at low average utilization, and there is tremendous competitive pressure to increase utilization. Conceptually, the way to increase utilization is clear: Run time-insensitive batch-job workloads as secondary background tasks whenever server capacity is underutilized; and evict these workloads when the server's primary task requires more resources. Big data analytic tasks would seem to be an ideal fit to run opportunistically on such transient resources in the cloud. In reality, however, modern distributed data processing systems such as MapReduce or Spark are designed to run as the primary task on dedicated hardware, and they perform badly on transiently available resources because of the excessive cost of cascading re-computations in case of evictions. In this paper, we propose a new framework for big data analytics on transient resources. Specifically, we design and implement TR-Spark, a version of Spark that can run highly efficiently as a secondary background task on transient (evictable) resources. The design of TR-Spark is based on two principles: resource stability and data size reduction-aware scheduling and lineage-aware checkpointing. The combination of these principles allows TR-Spark to naturally adapt to the stability characteristics of the underlying compute infrastructure. Evaluation results show that while regular Spark effectively fails to finish a job in clusters of even moderate instability, TR-Spark performs nearly as well as Spark running on stable resources.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124707180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optasia: A Relational Platform for Efficient Large-Scale Video Analytics","authors":"Yao Lu, Aakanksha Chowdhery, Srikanth Kandula","doi":"10.1145/2987550.2987564","DOIUrl":"https://doi.org/10.1145/2987550.2987564","url":null,"abstract":"Camera deployments are ubiquitous, but existing methods to analyze video feeds do not scale and are error-prone. We describe Optasia, a dataflow system that employs relational query optimization to efficiently process queries on video feeds from many cameras. Key gains of Optasia result from modularizing vision pipelines in such a manner that relational query optimization can be applied. Specifically, Optasia can (i) de-duplicate the work of common modules, (ii) auto-parallelize the query plans based on the video input size, number of cameras and operation complexity, (iii) offers chunk-level parallelism that allows multiple tasks to process the feed of a single camera. Evaluation on traffic videos from a large city on complex vision queries shows high accuracy with many fold improvements in query completion time and resource usage relative to existing systems.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124621314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Zhai, Lichao Yin, J. Chase, T. Ristenpart, M. Swift
{"title":"CQSTR: Securing Cross-Tenant Applications with Cloud Containers","authors":"Yan Zhai, Lichao Yin, J. Chase, T. Ristenpart, M. Swift","doi":"10.1145/2987550.2987558","DOIUrl":"https://doi.org/10.1145/2987550.2987558","url":null,"abstract":"Cloud providers are in a position to greatly improve the trust clients have in network services: IaaS platforms can isolate services so they cannot leak data, and can help verify that they are securely deployed. We describe a new system called CQSTR that allows clients to verify a service's security properties. CQSTR provides a new cloud container abstraction similar to Linux containers but for VM clusters within IaaS clouds. Cloud containers enforce constraints on what software can run, and control where and how much data can be communicated across service boundaries. With CQSTR, IaaS providers can make assertions about the security properties of a service running in the cloud. We investigate implementations of CQSTR on both Amazon AWS and OpenStack. With AWS, we build on virtual private clouds to limit network access and on authorization mechanisms to limit storage access. However, with AWS certain security properties can be checked only by monitoring audit logs for violations after the fact. We modified OpenStack to implement the full CQSTR model with only modest code changes. We show how to use CQSTR to build more secure deployments of the data analytics frameworks PredictionIO, PacketPig, and SpamAssassin. In experiments on CloudLab we found that the performance impact of CQSTR on applications is near zero.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131966492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RAMinate: Hypervisor-based Virtualization for Hybrid Main Memory Systems","authors":"Takahiro Hirofuchi, Ryousei Takano","doi":"10.1145/2987550.2987570","DOIUrl":"https://doi.org/10.1145/2987550.2987570","url":null,"abstract":"In the future, STT-MRAM will achieve larger capacity and comparable read/write performance, but incur orders of magnitude greater write energy than DRAM. To achieve large capacity as well as energy-efficiency, it is necessary to use both DRAM and STT-MRAM for the main memory of a computer. In this paper, we propose a hypervisor-based hybrid memory mechanism (RAMinate) that reduces write traffic to STT-MRAM by optimizing page locations between DRAM and STT-MRAM. In contrast to past studies, our mechanism works at the hypervisor level, not at the hardware or operating system level. It does not require any special program at the operating system level nor any design changes of the current memory controller at the hardware level. We developed a prototype of the proposed system by extending Qemu/KVM and conducted experiments with application benchmarks. We confirmed that our page replacement mechanism successfully worked for unmodified operating systems and dynamically diverted memory write traffic to DRAM. Our experiments also confirmed that our system successfully reduced write traffic to STT-MRAM by approximately 70% for tested workloads, which results in a 50% reduction in energy consumption in comparison to a DRAM-only system.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133364966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics","authors":"Benjamin Heintz, A. Chandra, R. Sitaraman","doi":"10.1145/2987550.2987580","DOIUrl":"https://doi.org/10.1145/2987550.2987580","url":null,"abstract":"Many applications must ingest rapid data streams and produce analytics results in near-real-time. It is increasingly common for inputs to such applications to originate from geographically distributed sources. The typical infrastructure for processing such geo-distributed streams follows a hub-and-spoke model, where several edge servers perform partial computation before forwarding results over a wide-area network (WAN) to a central location for final processing. Due to limited WAN bandwidth, it is not always possible to produce exact results. In such cases, applications must either sacrifice timeliness by allowing delayed---i.e., stale---results, or sacrifice accuracy by allowing some error in final results. In this paper, we focus on windowed grouped aggregation, an important and widely used primitive in streaming analytics, and we study the tradeoff between staleness and error. We present optimal offline algorithms for minimizing staleness under an error constraint and for minimizing error under a staleness constraint. Using these offline algorithms as references, we present practical online algorithms for effectively trading off timeliness and accuracy under bandwidth limitations. Using a workload derived from an analytics service offered by a large commercial CDN, we demonstrate the effectiveness of our techniques through both trace-driven simulation as well as experiments on an Apache Storm-based implementation deployed on PlanetLab. Our experiments show that our proposed algorithms reduce staleness by 81.8% to 96.6%, and error by 83.4% to 99.1% compared to a practical random sampling/batching-based aggregation algorithm across a diverse set of aggregation functions.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126530238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Harlap, Henggang Cui, Wei Dai, Jinliang Wei, G. Ganger, Phillip B. Gibbons, Garth A. Gibson, E. Xing
{"title":"Addressing the straggler problem for iterative convergent parallel ML","authors":"A. Harlap, Henggang Cui, Wei Dai, Jinliang Wei, G. Ganger, Phillip B. Gibbons, Garth A. Gibson, E. Xing","doi":"10.1145/2987550.2987554","DOIUrl":"https://doi.org/10.1145/2987550.2987554","url":null,"abstract":"FlexRR provides a scalable, efficient solution to the straggler problem for iterative machine learning (ML). The frequent (e.g., per iteration) barriers used in traditional BSP-based distributed ML implementations cause every transient slowdown of any worker thread to delay all others. FlexRR combines a more flexible synchronization model with dynamic peer-to-peer re-assignment of work among workers to address straggler threads. Experiments with real straggler behavior observed on Amazon EC2 and Microsoft Azure, as well as injected straggler behavior stress tests, confirm the significance of the problem and the effectiveness of FlexRR's solution. Using FlexRR, we consistently observe near-ideal run-times (relative to no performance jitter) across all real and injected straggler behaviors tested.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131867217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Fietz, S. Whitlock, G. Ioannidis, K. Argyraki, Edouard Bugnion
{"title":"VNToR: Network Virtualization at the Top-of-Rack Switch","authors":"J. Fietz, S. Whitlock, G. Ioannidis, K. Argyraki, Edouard Bugnion","doi":"10.1145/2987550.2987582","DOIUrl":"https://doi.org/10.1145/2987550.2987582","url":null,"abstract":"Cloud providers typically implement abstractions for network virtualization on the server, within the operating system that hosts the tenant virtual machines or containers. Despite being flexible and convenient, this approach has fundamental problems: incompatibility with bare-metal support, unnecessary performance overhead, and susceptibility to hypervisor breakouts. To solve these, we propose to offload the implementation of network-virtualization abstractions to the top-of-rack switch (ToR). To show that this is feasible and beneficial, we present VNToR, a ToR that takes over the implementation of the security-group abstraction. Our prototype combines commodity switching hardware with a custom software stack and is integrated in OpenStack Neutron. We show that VNToR can store tens of thousands of access rules, adapts to traffic-pattern changes in less than a millisecond, and significantly outperforms the state of the art.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130332505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Floratou, N. Megiddo, Navneet Potti, Fatma Özcan, Uday Kale, Jan Schmitz-Hermes
{"title":"Adaptive Caching in Big SQL using the HDFS Cache","authors":"A. Floratou, N. Megiddo, Navneet Potti, Fatma Özcan, Uday Kale, Jan Schmitz-Hermes","doi":"10.1145/2987550.2987553","DOIUrl":"https://doi.org/10.1145/2987550.2987553","url":null,"abstract":"The memory and storage hierarchy in database systems is currently undergoing a radical evolution in the context of Big Data systems. SQL-on-Hadoop systems share data with other applications in the Big Data ecosystem by storing their data in HDFS, using open file formats. However, they do not provide automatic caching mechanisms for storing data in memory. In this paper, we describe the architecture of IBM Big SQL and its use of the HDFS cache as an alternative to the traditional buffer pool, allowing in-memory data to be shared with other Big Data applications. We design novel adaptive caching algorithms for Big SQL tailored to the challenges of such an external cache scenario. Our experimental evaluation shows that only our adaptive algorithms perform well for diverse workload characteristics, and are able to adapt to evolving data access patterns. Finally, we discuss our experiences in addressing the new challenges imposed by external caching and summarize our insights about how to direct ongoing architectural evolution of external caching mechanisms.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121653685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weijia Song, Theo Gkountouvas, K. Birman, Qi Chen, Zhen Xiao
{"title":"The Freeze-Frame File System","authors":"Weijia Song, Theo Gkountouvas, K. Birman, Qi Chen, Zhen Xiao","doi":"10.1145/2987550.2987578","DOIUrl":"https://doi.org/10.1145/2987550.2987578","url":null,"abstract":"Many applications perform real-time analysis on data streams. We argue that existing solutions are poorly matched to the need, and introduce our new Freeze-Frame File System. Freeze-Frame FS is able to accept streams of updates while satisfying \"temporal reads\" on demand. The system is fast and accurate: we keep all update history in a memory-mapped log, cache recently retrieved data for repeat reads, and use a hybrid of a real-time and a logical clock to respond to read requests in a manner that is both temporally precise and causally consistent. When RDMA hardware is available, the write and read throughput of a single client reaches 2.6GB/s for writes and 5GB/s for reads, close to the limit (about 6GB/s) on the RDMA hardware used in our experiments. Even without RDMA, Freeze Frame FS substantially outperforms existing options for our target settings.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130522819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}