Pantazis Deligiannis, Narayanan Ganapathy, A. Lal, S. Qadeer
{"title":"Building Reliable Cloud Services Using Coyote Actors","authors":"Pantazis Deligiannis, Narayanan Ganapathy, A. Lal, S. Qadeer","doi":"10.1145/3472883.3486983","DOIUrl":"https://doi.org/10.1145/3472883.3486983","url":null,"abstract":"Cloud services must typically be distributed across a large number of machines in order to make use of multiple compute and storage resources. This opens the programmer to several sources of complexity such as concurrency, order of message delivery, lossy network, timeouts and failures, all of which impose a high cognitive burden. This paper presents evidence that technology inspired by formal-methods, delivered as part of a programming framework, can help address these challenges. In particular, we describe the experience of several engineering teams in Microsoft Azure that used the open-source Coyote Actor programming framework to build multiple reliable cloud services. Coyote Actors impose a principled design pattern that allows writing formal specifications alongside production code that can be systematically tested, without deviating from routine engineering practices. Engineering teams that have been using Coyote have reported dramatically increased productivity (in time taken to push new features to production) as well as services that have been running live for months without any issues in features developed and tested with Coyote.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74136078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis","authors":"Charles Reiss, Alexey Tumanov","doi":"10.1145/3472883.3517123","DOIUrl":"https://doi.org/10.1145/3472883.3517123","url":null,"abstract":"Test of Time Award Talk for Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis, SoCC 2012.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78920291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast and Accurate Optimizer for Query Processing over Knowledge Graphs","authors":"Jingqi Wu, Rong Chen, Yubin Xia","doi":"10.1145/3472883.3486991","DOIUrl":"https://doi.org/10.1145/3472883.3486991","url":null,"abstract":"This paper presents Gpl, a fast and accurate optimizer for query processing over knowledge graphs. Gpl is novel in three ways. First, Gpl proposes a type-centric approach to enhance the accuracy of cardinality estimation prominently, which naturally embeds the correlation of multiple query conditions into the existing type system of knowledge graphs. Second, to predict execution time accurately, Gpl constructs a specialized cost model for graph exploration scheme and tunes the coefficients with target hardware platform and graph data. Third, Gpl further uses a budget-aware strategy for plan enumeration with a greedy heuristic to boost the overall performance (i.e., optimization time and execution time) for various workloads. Evaluations with representative knowledge graphs and query benchmarks show that Gpl can select optimal plans for 33 of 39 queries and only incurs less than 5% slowdown on average compared to optimal results. In contrast, the state-of-the-art optimizer and manually tuned results will cause 100% and 36% slowdown, respectively.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76377378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parslo","authors":"Amirhossein Mirhosseini, S. Elnikety, T. Wenisch","doi":"10.1145/3472883.3486985","DOIUrl":"https://doi.org/10.1145/3472883.3486985","url":null,"abstract":"Modern cloud services are implemented as graphs of loosely-coupled microservices to improve programmability, reliability, and scalability. Service Level Objectives (SLOs) define end-to-end latency targets for the entire service to ensure user satisfaction. In such environments, each microservice is independently deployed and (auto-)scaled. However, it is unclear how to optimally scale individual microservices when end-to-end SLOs are violated or underutilized, and how to size each microservice to meet the end-to-end SLO at minimal total cost. In this paper, we propose Parslo---a Gradient Descent-based approach to assign partial SLOs among nodes in a microservice graph under an end-to-end latency SLO. At a high level, the Parslo algorithm breaks the end-to-end SLO budget into small incremental \"SLO units\", and iteratively allocates one marginal SLO unit to the best candidate microservice to achieve the highest total cost savings until the entire end-to-end SLO budget is exhausted. Parslo achieves a near-optimal solution, seeking to minimize the total cost for the entire service deployment, and is applicable to general microservice graphs that comprise patterns like dynamic branching, parallel fan-out, and microservice dependencies. Parslo reduces service deployment costs by more than 6x in real microservice-based applications, compared to a state-of-the-art partial SLO assignment scheme.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86605837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Gao, Zhisheng Ye, P. Sun, Yonggang Wen, Tianwei Zhang
{"title":"Chronus","authors":"Wei Gao, Zhisheng Ye, P. Sun, Yonggang Wen, Tianwei Zhang","doi":"10.1145/3472883.3486978","DOIUrl":"https://doi.org/10.1145/3472883.3486978","url":null,"abstract":"Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner. Job scheduling is the key to improve the training performance, resource utilization and fairness across users. Different training jobs may require various objectives and demands in terms of completion time. How to efficiently satisfy all these requirements is not extensively studied. We present Chronus, an end-to-end scheduling system to provide deadline guarantee for SLO jobs and maximize the performance of best-effort jobs. Chronus is designed based on the unique features of DLT jobs. (1) It leverages the intra-job predictability of DLT processes to efficiently profile jobs and estimate their runtime speed with dynamic resource scaling. (2) It takes advantages of the DLT preemption feature to select jobs with a lease-based training scheme. (3) It considers the placement sensitivity of DLT jobs to allocate resources with new consolidation and local-search strategies. Large-scale simulations on real-world job traces show that Chronus can reduce the deadline miss rate of SLO jobs by up to 14.7x, and the completion time of best-effort jobs by up to 19.9x, compared to existing schedulers. We also implement a prototype of Chronus atop Kubernents in a cluster of 120 GPUs to validate its practicability.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"88 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73388148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speedo","authors":"N. Daw, U. Bellur, Purushottam Kulkarni","doi":"10.1145/3472883.3486982","DOIUrl":"https://doi.org/10.1145/3472883.3486982","url":null,"abstract":"Structuring cloud applications as collections of interacting fine-grained microservices makes them scalable and affords the flexibility of hot upgrading parts of the application. The current avatar of serverless computing (FaaS) with its dynamic resource allocation and auto-scaling capabilities make it the deployment model of choice for such applications. FaaS platforms operate with user space dispatchers that receive requests over the network and make a dispatch decision to one of multiple workers (usually a container) distributed in the data center. With the granularity of microservices approaching execution times of a few milliseconds combined with loads approaching tens of thousands of requests a second, having a low dispatch latency of less than one millisecond becomes essential to keep up with line rates. When these microservices are part of a workflow making up an application, the orchestrator that coordinates the sequence in which microservices execute also needs to operate with microsecond latency. Our observations reveal that the most significant component of the dispatch/orchestration latency is the time it takes for the request to traverse into and out of the user space from the network. Motivated by the presence of a multitude of low power cores on today's SmartNICs, one approach to keeping up with these high line rates and the stringent latency expectations is to run both the dispatcher and the orchestrator close to the network on a SmartNIC. Doing so will save valuable cycles spent in transferring requests to and back from the user space. The operating characteristics of short-lived ephemeral state and low CPU burst requirements of FaaS dispatcher/orchestrator make them ideal candidates for offloading from the server to the NIC cores. This also brings other benefit of freeing up the server CPU. In this paper, we present Speedo--- a design for offloading of FaaS dispatch and orchestration services to the SmartNIC from the user space. We implemented Speedo on ASIC based Netronome Agilio SmartNICs and our comprehensive evaluation shows that Speedo brings down the dispatch latency from ~150ms to ~140μs at a load of 10K requests per second.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"235 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79703061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mert Toslali, E. Ates, Alex Ellis, Zhaoqing Zhang, Darby Huye, Lan Liu, Samantha Puterman, A. Coskun, Raja R. Sambasivan
{"title":"Automating instrumentation choices for performance problems in distributed applications with VAIF","authors":"Mert Toslali, E. Ates, Alex Ellis, Zhaoqing Zhang, Darby Huye, Lan Liu, Samantha Puterman, A. Coskun, Raja R. Sambasivan","doi":"10.1145/3472883.3487000","DOIUrl":"https://doi.org/10.1145/3472883.3487000","url":null,"abstract":"Developers use logs to diagnose performance problems in distributed applications. However, it is difficult to know a priori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We present the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applications. In response to newly-observed performance problems, VAIF automatically searches the space of possible instrumentation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed tracing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the critical-path portions of requests' traces. We evaluate VAIF by using it to localize performance problems in OpenStack and HDFS. We show that VAIF can localize problems related to slow code paths, resource contention, and problematic third-party code while enabling only 3-34% of the total tracing instrumentation.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87921214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Future of Cloud Data: Challenges and Research Opportunities","authors":"Peter D. Bailis","doi":"10.1145/3472883.3517040","DOIUrl":"https://doi.org/10.1145/3472883.3517040","url":null,"abstract":"The last several years have seen the creation of hundreds of billions of dollars in market value ? including the largest software IPO of all time ? centered around one technology category: cloud data. While cloud data is not new, the rate of adoption across almost every industry and the associated pace of development around all aspects of cloud data (from pipelines to extract-load-transform (ELT) tools to storage and analytics) are unprecedented. In this talk, I'll present a research-oriented perspective on the future of cloud data that combines my experiences as an academic at Stanford and as a startup founder and CEO at Sisu Data. My goal is to provide an overview of the seismic changes in the cloud data landscape that--in my opinion--have yet to receive sufficient attention from research, and to highlight several tantalizing research opportunities in systems and databases that result.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85942895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scrooge","authors":"Yitao Hu, Rajrup Ghosh, R. Govindan","doi":"10.1145/3472883.3486993","DOIUrl":"https://doi.org/10.1145/3472883.3486993","url":null,"abstract":"Advances in deep learning (DL) have prompted the development of cloud-hosted DL-based media applications that process video and audio streams in real-time. Such applications must satisfy throughput and latency objectives and adapt to novel types of dynamics, while incurring minimal cost. Scrooge, a system that provides media applications as a service, achieves these objectives by packing computations efficiently into GPU-equipped cloud VMs, using an optimization formulation to find the lowest cost VM allocations that meet the performance objectives, and rapidly reacting to variations in input complexity (e.g., changes in participants in a video). Experiments show that Scrooge can save serving cost by 16-32% (which translate to tens of thousands of dollars per year) relative to the state-of-the-art while achieving latency objectives for over 98% under dynamic workloads.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78896456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging Data to Improve Cloud Services","authors":"Ranjita Bhagwan","doi":"10.1145/3472883.3517038","DOIUrl":"https://doi.org/10.1145/3472883.3517038","url":null,"abstract":"Today's cloud services are large, complex, and dynamic, often supporting billions of users. Such a complex and dynamic environment poses several challenges such as ensuring fast and secure development and deployment, and prompt resolution of service disruptions. Nevertheless, new opportunities to address such challenges have emerged. Large-scale services generate petabytes of code, test, and usage-related data within just one day. This data can be harnessed to provide valuable insights to engineers on how to improve service performance, security and reliability. However, cherry-picking important information from such vast amounts of systems-related data proves to be a formidable task. Over the last few years, we have developed many analysis tools that leverage code, test logs and telemetry to address these challenges. In this talk, I will talk about our experience with building such tools, and describe our journey which started with determining the right problems to solve, making research contributions and ended with widespread deployment across Microsoft's services.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88245493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}