Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)最新文献_第7页

Pigeon 鸽子

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI: 10.1145/3357223.3362728

Zhijun Wang, Huiyang Li, Zhongwei Li, Xiaocui Sun, J. Rao, Hao Che, Hong Jiang

{"title":"Pigeon","authors":"Zhijun Wang, Huiyang Li, Zhongwei Li, Xiaocui Sun, J. Rao, Hao Che, Hong Jiang","doi":"10.1145/3357223.3362728","DOIUrl":"https://doi.org/10.1145/3357223.3362728","url":null,"abstract":"In today's datacenters, job heterogeneity makes it difficult for schedulers to simultaneously meet latency requirements and maintain high resource utilization. The state-of-the-art datacenter schedulers, including centralized, distributed, and hybrid schedulers, fail to ensure low latency for short jobs in large-scale and highly loaded systems. The key issues are the scalability in centralized schedulers, ineffective and inefficient probing and resource sharing in both distributed and hybrid schedulers. In this paper, we propose Pigeon, a distributed, hierarchical job scheduler based on a two-layer design. Pigeon divides workers into groups, each managed by a separate master. In Pigeon, upon a job arrival, a distributed scheduler directly distribute tasks evenly among masters with minimum job processing overhead, hence, preserving highest possible scalability. Meanwhile, each master manages and distributes all the received tasks centrally, oblivious of the job context, allowing for full sharing of the worker pool at the group level to maximize multiplexing gain. To minimize the chance of head-of-line blocking for short jobs and avoid starvation for long jobs, two weighted fair queues are employed in each master to accommodate tasks from short and long jobs, separately, and a small portion of the workers are reserved for short jobs. Evaluation via theoretical analysis, trace-driven simulations, and a prototype implementation shows that Pigeon significantly outperforms Sparrow, a representative distributed scheduler, and Eagle, a hybrid scheduler.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"243 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76640720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Coupling Decentralized Key-Value Stores with Erasure Coding 耦合去中心化键值存储与Erasure编码

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI: 10.1145/3357223.3362713

Liangfeng Cheng, Yuchong Hu, P. Lee

{"title":"Coupling Decentralized Key-Value Stores with Erasure Coding","authors":"Liangfeng Cheng, Yuchong Hu, P. Lee","doi":"10.1145/3357223.3362713","DOIUrl":"https://doi.org/10.1145/3357223.3362713","url":null,"abstract":"Modern decentralized key-value stores often replicate and distribute data via consistent hashing for availability and scalability. Compared to replication, erasure coding is a promising redundancy approach that provides availability guarantees at much lower cost. However, when combined with consistent hashing, erasure coding incurs a lot of parity updates during scaling (i.e., adding or removing nodes) and cannot efficiently handle degraded reads caused by scaling. In this paper, we propose a novel erasure coding model called FragEC, which incurs no parity updates during scaling. We further extend consistent hashing with multiple hash rings to enable erasure coding to seamlessly address degraded reads during scaling. We realize our design as an in-memory key-value store called ECHash, and conduct testbed experiments on different scaling workloads in both local and cloud environments. We show that ECHash achieves better scaling performance (in terms of scaling throughput and degraded read latency during scaling) over the baseline erasure coding implementation, while maintaining high basic I/O and node repair performance.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"120 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80216873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

RapidCDC

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI: 10.1145/3357223.3362731

Fan Ni, Song Jiang

{"title":"RapidCDC","authors":"Fan Ni, Song Jiang","doi":"10.1145/3357223.3362731","DOIUrl":"https://doi.org/10.1145/3357223.3362731","url":null,"abstract":"I/O deduplication is a key technique for improving storage systems' space and I/O efficiency. Among various deduplication techniques content-defined chunking (CDC) based deduplication is the most desired one for its high deduplication ratio. However, CDC is compute-intensive and time-consuming, and has been recognized as a major performance bottleneck of the CDC-based deduplication system. In this paper we leverage the existence of a property in the duplicate data, named duplicate locality, that reveals the fact that multiple duplicate chunks are likely to occur together. In other words, one duplicate chunk is likely to be immediately followed by a sequence of contiguous duplicate chunks. The longer the sequence, the stronger the locality is. After a quantitative analysis of duplicate locality in real-world data, we propose a suite of chunking techniques that exploit the locality to remove almost all chunking cost for deduplicatable chunks in CDC-based deduplication systems. The resulting deduplication method, named RapidCDC, has two salient features. One is that its efficiency is positively correlated to the deduplication ratio. RapidCDC can be as fast as a fixed-size chunking method when applied on data sets with high data redundancy. The other feature is that its high efficiency does not rely on high duplicate locality strength. These attractive features make RapidCDC's effectiveness almost guaranteed for datasets with high deduplication ratio. Our experimental results with synthetic and real-world datasets show that RapidCDC's chunking speedup can be up to 33x higher than regular CDC. Meanwhile, it maintains (nearly) the same deduplication ratio.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88457653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

A System-Wide Debugging Assistant Powered by Natural Language Processing 一个由自然语言处理驱动的系统范围调试助手

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI: 10.1145/3357223.3362701

Pradeep Dogga, Karthik Narasimhan, Anirudh Sivaraman, R. Netravali

引用次数: 8

BurScale

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI: 10.1145/3357223.3362706

A. F. Baarzi, T. Zhu, B. Urgaonkar

{"title":"BurScale","authors":"A. F. Baarzi, T. Zhu, B. Urgaonkar","doi":"10.1145/3357223.3362706","DOIUrl":"https://doi.org/10.1145/3357223.3362706","url":null,"abstract":"Cloud providers have recently introduced burstable instances - virtual machines whose CPU capacity is rate limited by token-bucket mechanisms. A user of a burstable instance is able to burst to a much higher resource capacity (\"peak rate\") than the instance's long-term average capacity (\"sustained rate\"), provided the bursts are short and infrequent. A burstable instance tends to be much cheaper than a conventional instance that is always provisioned for the peak rate. Consequently, cloud providers advertise burstable instances as cost-effective options for customers with intermittent needs and small (e.g., single VM) clusters. By contrast, this paper presents two novel usage scenarios for burstable instances in larger clusters with sustained usage. We demonstrate (i) how burstable instances can be utilized alongside conventional instances to handle the transient queueing arising from variability in traffic, and (ii) how burstable instances can mask the VM startup/warmup time when autoscaling to handle flash crowds. We implement our ideas in a system called BurScale and use it to demonstrate cost-effective autoscaling for two important workloads: (i) a stateless web server cluster, and (ii) a stateful Memcached caching cluster. Results from our prototype system show that via its careful combination of burstable and regular instances, BurScale can ensure similar application performance as traditional autoscaling systems that use all regular instances while reducing cost by up to 50%.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74396939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

WormSpace WormSpace

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI: 10.1145/3357223.3362739

Ji-Yong Shin, Jieung Kim, Wolf Honoré, Hernán Vanzetto, S. Radhakrishnan, Mahesh Balakrishnan, Zhong Shao

{"title":"WormSpace","authors":"Ji-Yong Shin, Jieung Kim, Wolf Honoré, Hernán Vanzetto, S. Radhakrishnan, Mahesh Balakrishnan, Zhong Shao","doi":"10.1145/3357223.3362739","DOIUrl":"https://doi.org/10.1145/3357223.3362739","url":null,"abstract":"We propose the Write-Once Register (WOR) as an abstraction for building and verifying distributed systems. A WOR exposes a simple, data-centric API: clients can capture, write, and read it. Applications can use a sequence or a set of WORs to obtain properties such as durability, concurrency control, and failure atomicity. By hiding the logic for distributed coordination underneath a data-centric API, the WOR abstraction enables easy, incremental, and extensible implementation and verification of applications built above it. We present the design, implementation, and verification of a system called WormSpace that provides developers with an address space of WORs, implementing each WOR via a Paxos instance. We describe three applications built over WormSpace: a flexible, efficient Multi-Paxos implementation; a shared log implementation with lower append latency than the state-of-the-art; and a fault-tolerant transaction coordinator that uses an optimal number of round-trips. We show that these applications are simple, easy to verify, and match the performance of unverified monolithic implementations. We use a modular layered verification approach to link the proofs for WormSpace, its applications, and a verified operating system to produce the first verified distributed system stack from the application to the operating system.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"110 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74381048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

SpIitServe

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI: 10.1145/3357223.3366027

Aman Jain, A. F. Baarzi, Nader Alfares, G. Kesidis, B. Urgaonkar, M. Kandemir

{"title":"SpIitServe","authors":"Aman Jain, A. F. Baarzi, Nader Alfares, G. Kesidis, B. Urgaonkar, M. Kandemir","doi":"10.1145/3357223.3366027","DOIUrl":"https://doi.org/10.1145/3357223.3366027","url":null,"abstract":"Amazon Web Services (AWS) Lambdas and other \"cloud functions\" (CFs) offer much lower startup latencies than virtual machines (VMs) (tens/hundreds of milliseconds vs. a few/several minutes) with lower minimum cost. This makes it appealing to use them for handling unexpected spikes in simple, stateless workloads [2, 3, 5]. If the spike persists, additional VMs may be launched and CFs can be decommissioned when the VMs are ready (VMs are cheaper per unit resource procured than CFs). However, it is not immediately clear if using CFs for complex workloads - those involving significant state exchange among components - is similarly effective. Current CFs have several restrictions that may limit their efficacy: (i) relatively limited resource capacity, especially main memory (e.g., an AWS Lambda may only have up to 3GB memory), (ii) limited lifetime (e.g., Lambdas are terminated after 15 minutes), and (iii) limited support for sharing of intermediate state (e.g., Lambdas must employ an external storage system such as AWS S3). Contrary to conventional wisdom, we show that it is possible to exploit the faster startup times of CFs to improve cost and performance of autoscaling even for complex workloads. Approach: We design SplitServe [1], implemented as an enhancement of Apache Spark [4], that is capable of simultaneously using AWS VMs and Lambdas for serving the tasks comprising a parallel Spark job. The most salient challenges addressed and design choices made in our efforts are: (i) State exchange: Instead of relying on a slower external cloud storage to transfer state, we leverage the resources associated with the procured VMs and employ HDFS for state exchange. We find that this allows both VMs and Lambdas to achieve throughputs close to that of local disks. Since we are using already provisioned disk capacity, we do not pay extra (as we would if we were to use, say, AWS S3). (ii) Segueing from Lambdas to newly available VMs: Simply killing ongoing tasks on Lambdas and rerunning them on newly available VMs triggers Spark's high overhead fault tolerance mechanisms. So, a diaphanous scheduling decision, based on the amount of time a Lambda function has been running, is made at per task granularity. Briefly, as the time since a Lambda was launched approaches the common-case startup delay for a VM, new tasks are not sent to the Lambda. Findings: In our experiments, we find that SplitServe reduces overall job execution time compared to the state of the art with either a homogeneous or heterogeneous execution environment, i.e., either all VMs or all Lambdas, or simultaneously involving both VMs and Lambdas to execute a job's tasks. For the heterogeneous case, our experimental evaluation of SplitServe using four different workloads (interactive TCP-DS, K-means clustering, PageRank, and Pi) shows that SplitServe-Spark improves performance up to 55% for workloads with small to modest amount of shuffling, and up to 31% in workloads with large amounts of shuffl","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79925195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Linear Quadratic Regulator for Resource-Efficient Cloud Services 资源高效云服务的线性二次型调节器

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI: 10.1145/3357223.3366028

Youngsuk Park, K. Mahadik, Ryan A. Rossi, Gang Wu, Handong Zhao

{"title":"Linear Quadratic Regulator for Resource-Efficient Cloud Services","authors":"Youngsuk Park, K. Mahadik, Ryan A. Rossi, Gang Wu, Handong Zhao","doi":"10.1145/3357223.3366028","DOIUrl":"https://doi.org/10.1145/3357223.3366028","url":null,"abstract":"1 PROBLEM AND MOTIVATION The run-time performance of modern applications deployed within containers in the cloud critically depends on the amount of provisioned resources. Provisioning fewer resources can result in performance degradation and costly SLA violations; while allocating more resources leads to wasted money and poor resource utilization. Moreover, these applications undergo striking variations in load patterns. To automatically adapt resources in response to changes in load, an autoscaler applies predefined heuristics to add or remove resources allocated for an application based on usage thresholds. However, it is extremely challenging to configure thresholds and scaling parameters in an applicationagnostic manner without deep workload analysis. Poor resource efficiency results in high operating costs and energy expenditures for all cloud-based enterprises. To improve resource efficiency over safe and hand-tuned autoscaling schemes reinforcement learning (RL) based approaches [3, 4] learn an optimal scaling actions through experience (trial-anderror) for every application state, based on the input workload, or other variables. After the learning agent executes an action, it receives a response from the environment, based on the usefulness of the action. The agent is inclined to execute actions that provide higher rewards, thus reinforcing better actions. However, these proposed approaches suffer from the curse of dimensionality [2]. The state and action space to be discretized grows exponentially with the number of state variables, leading to scalability problems, manifested in unacceptable execution times in updation and selection of the next action to be executed in online setting. Moreover, these methods require huge amount of of samples to learn and thus often lack stability and interpretability of policy.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82361389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Lessons from Large-Scale Software as a Service at Databricks Databricks上大规模软件即服务的经验教训

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI: 10.1145/3357223.3365870

M. Zaharia

{"title":"Lessons from Large-Scale Software as a Service at Databricks","authors":"M. Zaharia","doi":"10.1145/3357223.3365870","DOIUrl":"https://doi.org/10.1145/3357223.3365870","url":null,"abstract":"The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software, which has not been heavily studied in research. I will explain some of these challenges based on my experience at Databricks, a startup that provides a data analytics platform as a service on AWS and Azure. Databricks manages millions of VMs per day to run data engineering and machine learning workloads using Apache Spark, TensorFlow, Python and other software for thousands of customers. Two main challenges arise in this context: (1) building a reliable, scalable control plane that can manage thousands of customers at once and (2) adapting the data processing software itself (e.g. Apache Spark) for an elastic cloud environment (for instance, autoscaling instead of assuming static clusters). These challenges are especially significant for data analytics workloads whose users constantly push boundaries in terms of scale (e.g. number of VMs used, data size, metadata size, number of concurrent users, etc). I'll describe some of the common challenges that our new services face and some of the main ways that Databricks has extended and modified open source analytics software for the cloud environment (e.g., designing an autoscaling engine for Apache Spark and creating a transactional storage layer on top of S3 in the Delta Lake open source product).","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87283198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Securing Data in Compromised Clouds 保护受损云中的数据

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2019-11-20 DOI: 10.1145/3357223.3365869

R. A. Popa

引用次数: 0