Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)最新文献_第4页

OptDebug

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI: 10.1145/3472883.3487016

Muhammad Ali Gulzar, Miryung Kim

{"title":"OptDebug","authors":"Muhammad Ali Gulzar, Miryung Kim","doi":"10.1145/3472883.3487016","DOIUrl":"https://doi.org/10.1145/3472883.3487016","url":null,"abstract":"Fault-isolation is extremely challenging in large scale data processing in cloud environments. Data provenance is a dominant existing approach to isolate data records responsible for a given output. However, data provenance concerns fault isolation only in the data-space, as opposed to fault isolation in the code-space---how can we precisely localize operations or APIs responsible for a given suspicious or incorrect result? We present OptDebug that identifies fault-inducing operations in a dataflow application using three insights. First, debugging is easier with a small-scale input than a large-scale input. So it uses data provenance to simplify the original input records to a smaller set leading to test failures and test successes. Second, keeping track of operation provenance is crucial for debugging. Thus, it leverages automated taint analysis to propagate the lineage of operations downstream with individual records. Lastly, each operation may contribute to test failures to a different degree. Thus OptDebug ranks each operation's spectra---the relative participation frequency in failing vs. passing tests. In our experiments, OptDebug achieves 100% recall and 86% precision in terms of detecting faulty operations and reduces the debugging time by 17x compared to a naïve approach. Overall, OptDebug shows great promise in improving developer productivity in today's complex data processing pipelines by obviating the need to re-execute the program repetitively with different inputs and manually examine program traces to isolate buggy code.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79374818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Kraken

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486992

Vivek M. Bhasi, J. Gunasekaran, P. Thinakaran, Cyan Subhra Mishra, M. Kandemir, C. Das

{"title":"Kraken","authors":"Vivek M. Bhasi, J. Gunasekaran, P. Thinakaran, Cyan Subhra Mishra, M. Kandemir, C. Das","doi":"10.1145/3472883.3486992","DOIUrl":"https://doi.org/10.1145/3472883.3486992","url":null,"abstract":"The growing popularity of microservices has led to the proliferation of online cloud service-based applications, which are typically modelled as Directed Acyclic Graphs (DAGs) comprising of tens to hundreds of microservices. The vast majority of these applications are user-facing, and hence, have stringent SLO requirements. Serverless functions, having short resource provisioning times and instant scalability, are suitable candidates for developing such latency-critical applications. However, existing serverless providers are unaware of the workflow characteristics of application DAGs, leading to container over-provisioning in many cases. This is further exacerbated in the case of dynamic DAGs, where the function chain for an application is not known a priori. Motivated by these observations, we propose Kraken, a workflow-aware resource management framework that minimizes the number of containers provisioned for an application DAG while ensuring SLO-compliance. We design and implement Kraken on OpenFaaS and evaluate it on a multi-node Kubernetes-managed cluster. Our extensive experimental evaluation using DeathStarbench workload suite and real-world traces demonstrates that Kraken spawns up to 76% fewer containers, thereby improving container utilization and saving cluster-wide energy by up to 4x and 48%, respectively, when compared to state-of-the art schedulers employed in serverless platforms.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78349882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Portkey 门钥匙

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI: 10.1145/3472883.3487004

Joseph Noor, M. Srivastava, R. Netravali

{"title":"Portkey","authors":"Joseph Noor, M. Srivastava, R. Netravali","doi":"10.1145/3472883.3487004","DOIUrl":"https://doi.org/10.1145/3472883.3487004","url":null,"abstract":"Owing to a need for low latency data accesses, emerging IoT and mobile applications commonly require distributed data stores (e.g., key-value or KV stores) to operate entirely at the network's edge. Unfortunately, existing KV stores employ randomized data placement policies (e.g., consistent hashing) that ignore the client mobility and resulting variance in client-server latencies that are inherent to edge applications---the effect is largely suboptimal and inefficient data placement. We present Portkey, a distributed KV store that dynamically adapts data placement according to time-varying client mobility and data access patterns. The key insight with Portkey is to lean into the inherent mobility and prioritize rapid but approximate placement decisions over delayed optimal ones. Doing so enables the efficient tracking of client-server latencies despite edge resource constraints, and the use of greedy placement heuristics that are self-correcting over short timescales. Results with a realistic autonomous vehicle dataset and two small-scale deployments reveal that Portkey reduces average and tail request latency by 21-82% and 26-77% compared to existing placement strategies.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83672806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

ServerMore

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486979

Amoghavarsha Suresh, Anshul Gandhi

引用次数: 11

Tell me when you are sleepy and what may wake you up! 告诉我你什么时候困了，什么会把你叫醒!

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI: 10.1145/3472883.3487013

Djob Mvondo, A. Barbalace, A. Tchana, Gilles Muller

{"title":"Tell me when you are sleepy and what may wake you up!","authors":"Djob Mvondo, A. Barbalace, A. Tchana, Gilles Muller","doi":"10.1145/3472883.3487013","DOIUrl":"https://doi.org/10.1145/3472883.3487013","url":null,"abstract":"Nowadays, there is a shift in the deployment model of Cloud and Edge applications. Applications are now deployed as a set of several small units communicating with each other - the microservice model. Moreover, each unit - a microservice, may be implemented as a virtual machine, container, function, etc., spanning the different Cloud and Edge service models including IaaS, PaaS, FaaS. A microservice is instantiated upon the reception of a request (e.g., an http packet or a trigger), and a rack-level or data-center-level scheduler decides the placement for such unit of execution considering for example data locality and load balancing. With such a configuration, it is common to encounter scenarios where different units, as well as multiple instances of the same unit, may be running on a single server at the same time. When multiple microservices are running on the same server not necessarily all of them are doing actual processing, some may be busy-waiting - i.e., waiting for events (or requests) sent by other units. However, these \"idle\" units are consuming CPU time which could be used by other running units or cloud utility functions on the server (e.g., monitoring daemons). In a controlled experiment, we observe that units can spend up to 20% - 55% of their CPU time waiting, thus a great amount of CPU time is wasted; these values significantly grow when overcommitting CPU resources (i.e., units CPU reservations exceed server CPU capacity), where we observe up to 69% - 75%. This is a result of the lack of information/context about what is running in each unit from the server CPU scheduler perspective. In this paper, we first provide evidence of the problem and discuss several research questions. Then, we propose an handful of solutions worth exploring that consists in revisiting hypervisor and host OS scheduler designs to reduce the CPU time wasted on idle units. Our proposal leverages the concepts of informed scheduling, and monitoring for internal and external events. Based on the aforementioned solutions, we propose our initial implementation on Linux/KVM.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85597849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Towards Reliable AI for Source Code Understanding 面向源代码理解的可靠AI

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486995

Sahil Suneja, Yunhui Zheng, Yufan Zhuang, Jim Laredo, Alessandro Morari

引用次数: 6

George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints George:学习将长寿命容器放置在具有操作约束的大型集群中

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486971

Suyi Li, Luping Wang, Wen Wang, Yinghao Yu, Bo Li

{"title":"George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints","authors":"Suyi Li, Luping Wang, Wen Wang, Yinghao Yu, Bo Li","doi":"10.1145/3472883.3486971","DOIUrl":"https://doi.org/10.1145/3472883.3486971","url":null,"abstract":"Online cloud services are widely deployed as Long-Running Applications (LRAs) hosted in containers. Placing LRA containers turns out to be particularly challenging due to the complex interference between co-located containers and the operation constraints in production clusters such as fault tolerance, disaster avoidance and incremental deployment. Existing schedulers typically provide APIs for operators to manually specify the container scheduling requirements and offer only qualitative scheduling guidelines for container placement. Such schedulers, do not perform well in terms of both performance and scale, while also requiring manual intervention. In this work, we propose George, an end-to-end generalpurpose LRA scheduler by leveraging the state-of-the-art Reinforcement Learning (RL) techniques to intelligently schedule LRA containers. We present an optimal container placement formulation for the first time with the objective of maximizing container placement performance subject to a set of operation constraints. One fundamental challenge in scheduling is to categorically satisfy different operation constraints in practice; specifically, to guarantee hard constraints and ensure soft constraints violations within a pre-defined threshold. We design a novel projection-based proximal policy optimization (PPPO) algorithm in combination with an Integer Linear optimization technique to intelligently schedule LRA containers under operation constraints. In order to reduce the training time, we apply transfer learning technique by taking advantage of the similarity in different LRA scheduling events. We prove theoretically that our proposed algorithm is effective, stable, and safe. We implement George as a plug-in service in Docker Swarm. Our in-house cluster demonstrates that George can maximize the LRA performance while enforcing the hard constraints and the soft constraints with a pre-defined threshold. The experiments show that George improves LRA performance and scale tremendously by requiring less than 1 hour scheduling time in a large cluster with 2K containers and 700 machines, 16x faster than existing schedulers. Compared with state-of-the-art alternatives, George also achieves 26% higher container performance with up to 70% lower constraint violation.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76124128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis 表征微服务依赖和性能:阿里巴巴跟踪分析

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI: 10.1145/3472883.3487003

Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, Chengzhong Xu

{"title":"Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis","authors":"Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, Chengzhong Xu","doi":"10.1145/3472883.3487003","DOIUrl":"https://doi.org/10.1145/3472883.3487003","url":null,"abstract":"Loosely-coupled and light-weight microservices running in containers are replacing monolithic applications gradually. Understanding the characteristics of microservices is critical to make good use of microservice architectures. However, there is no comprehensive study about microservice and its related systems in production environments so far. In this paper, we present a solid analysis of large-scale deployments of microservices at Alibaba clusters. Our study focuses on the characterization of microservice dependency as well as its runtime performance. We conduct an in-depth anatomy of microservice call graphs to quantify the difference between them and traditional DAGs of data-parallel jobs. In particular, we observe that microservice call graphs are heavy-tail distributed and their topology is similar to a tree and moreover, many microservices are hot-spots. We reveal three types of meaningful call dependency that can be utilized to optimize microservice designs. Our investigation on microservice runtime performance indicates most microservices are much more sensitive to CPU interference than memory interference. To synthesize more representative microservice traces, we build a mathematical model to simulate call graphs. Experimental results demonstrate our model can well preserve those graph properties observed from Alibaba traces.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79663199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

3MileBeach

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486986

Jun Zhang, Robert Ferydouni, Aldrin Montana, Daniel Bittman, P. Alvaro

引用次数: 5

Good Things Come to Those Who Wait: Optimizing Job Waiting in the Cloud 等待的人会有好事:优化云中的工作等待

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI: 10.1145/3472883.3487007

Pradeep Ambati, Noman Bashir, David E. Irwin, P. Shenoy

{"title":"Good Things Come to Those Who Wait: Optimizing Job Waiting in the Cloud","authors":"Pradeep Ambati, Noman Bashir, David E. Irwin, P. Shenoy","doi":"10.1145/3472883.3487007","DOIUrl":"https://doi.org/10.1145/3472883.3487007","url":null,"abstract":"Cloud-enabled schedulers execute jobs on either fixed resources or those acquired on demand from cloud platforms. Thus, these schedulers must define not only a scheduling policy, which selects which jobs run when fixed resources become available, but also a waiting policy, which selects which jobs wait for fixed resources when they are not available, rather than run on on-demand resources. As with scheduling policies, optimizing waiting policies requires a priori knowledge of job runtime. Unfortunately, prior work has shown that accurately predicting job runtime is challenging. In this paper, we show that optimizing job waiting in the cloud is possible without accurate job runtime predictions. To do so, we i) speculatively execute jobs on on-demand resources for a small time and cost to learn more about job runtime, and ii) develop a ML model to predict wait time from cluster state, which is more accurate and has less overhead than prior approaches that use job runtime predictions. We evaluate our approach on a year-long batch workload consisting of 14 million jobs, and show that it yields a cost and average wait time within 4% and 13%, respectively, of the optimal.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81868370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2