{"title":"Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis","authors":"Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, Chengzhong Xu","doi":"10.1145/3472883.3487003","DOIUrl":"https://doi.org/10.1145/3472883.3487003","url":null,"abstract":"Loosely-coupled and light-weight microservices running in containers are replacing monolithic applications gradually. Understanding the characteristics of microservices is critical to make good use of microservice architectures. However, there is no comprehensive study about microservice and its related systems in production environments so far. In this paper, we present a solid analysis of large-scale deployments of microservices at Alibaba clusters. Our study focuses on the characterization of microservice dependency as well as its runtime performance. We conduct an in-depth anatomy of microservice call graphs to quantify the difference between them and traditional DAGs of data-parallel jobs. In particular, we observe that microservice call graphs are heavy-tail distributed and their topology is similar to a tree and moreover, many microservices are hot-spots. We reveal three types of meaningful call dependency that can be utilized to optimize microservice designs. Our investigation on microservice runtime performance indicates most microservices are much more sensitive to CPU interference than memory interference. To synthesize more representative microservice traces, we build a mathematical model to simulate call graphs. Experimental results demonstrate our model can well preserve those graph properties observed from Alibaba traces.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79663199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pratiksha Thaker, Hudson Ayers, Deepti Raghavan, Ning Niu, P. Levis, M. Zaharia
{"title":"Clamor","authors":"Pratiksha Thaker, Hudson Ayers, Deepti Raghavan, Ning Niu, P. Levis, M. Zaharia","doi":"10.1145/3472883.3486996","DOIUrl":"https://doi.org/10.1145/3472883.3486996","url":null,"abstract":"We propose Clamor, a functional cluster computing framework that adds support for fine-grained, transparent access to global variables for distributed, data-parallel tasks. Clamor targets workloads that perform sparse accesses and updates within the bulk synchronous parallel execution model, a setting where the standard technique of broadcasting global variables is highly inefficient. Clamor implements a novel dynamic replication mechanism in order to enable efficient access to popular data regions on the fly, and tracks finegrained dependencies in order to retain the lineage-based fault tolerance model of systems like Spark. Clamor can integrate with existing Rust and C++ libraries to transparently distribute programs on the cluster. We show that Clamor is competitive with Spark in simple functional workloads and can improve performance significantly compared to custom systems on workloads that sparsely access large global variables: from 5x for sparse logistic regression to over 100x on distributed geospatial queries.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78704273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Morphling","authors":"Luping Wang, Lingyun Yang, Yinghao Yu, Wei Wang, Bolong Li, Xianchao Sun, Jian He, Liping Zhang","doi":"10.1145/3472883.3486987","DOIUrl":"https://doi.org/10.1145/3472883.3486987","url":null,"abstract":"Machine learning models are widely deployed in production cloud to provide online inference services. Efficiently deploying inference services requires careful tuning of hardware and runtime configurations (e.g., GPU type, GPU memory, batch size), which can significantly improve the model serving performance and reduce cost. However, existing autoconfiguration approaches for general workloads, such as Bayesian optimization and white-box prediction, are inefficient in navigating the high-dimensional configuration space of model serving, incurring high sampling cost. In this paper, we present Morphling, a fast, near-optimal auto-configuration framework for cloud-native model serving. Morphling employs model-agnostic meta-learning to navigate the large configuration space. It trains a metamodel offline to capture the general performance trend under varying configurations. Morphling quickly adapts the metamodel to a new inference service by sampling a small number of configurations and uses it to find the optimal one. We have implemented Morphling as an auto-configuration service in Kubernetes, and evaluate its performance with popular CV and NLP models, as well as the production inference services in Alibaba. Compared with existing approaches, Morphling reduces the median search cost by 3x-22x, quickly converging to the optimal configuration by sampling only 30 candidates in a large search space consisting of 720 options.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74655796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cody Hao Yu, Xingjian Shi, Haichen Shen, Zhi Chen, Mu Li, Yida Wang
{"title":"Lorien: Efficient Deep Learning Workloads Delivery","authors":"Cody Hao Yu, Xingjian Shi, Haichen Shen, Zhi Chen, Mu Li, Yida Wang","doi":"10.1145/3472883.3486973","DOIUrl":"https://doi.org/10.1145/3472883.3486973","url":null,"abstract":"Modern deep learning systems embrace the compilation idea to self generate code of a deep learning model to catch up the rapidly changed deep learning operators and newly emerged hardware platforms. The performance of the self-generated code is guaranteed via auto-tuning frameworks which normally take a long time to find proper execution schedules for the given operators, which hurts both user experiences and time-to-the-market in terms of model developments and deployments. To efficiently deliver a high-performance schedule upon requests, in this paper, we present Lorien, an open source infrastructure, to tune the operators and orchestrate the tuned schedules in a systematic way. Lorien is designed to be extensible to state-of-the-art auto-tuning frameworks, and scalable to coordinate a number of compute resources for its tuning tasks with fault tolerance. We leveraged Lorien to extract thousands of operator-level tuning tasks from 29 widely-used models in Gluon CV model zoo [22], and tune them on x86 CPU, ARM CPU, and NVIDIA GPU to construct a database for queries. In addition, to deliver reasonably high performance schedules for unseen workloads in seconds or minutes, Lorien integrates an AutoML solution to train a performance cost model with collected large-scale datasets. Our evaluation shows that the AutoML-based solution is accurate enough to enable zero-shot tuning, which does not fine-tune the cost model during tuning nor perform on-device measurements, and is able to find decent schedules with at least 10x less time than existing auto-tuning frameworks.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"17 3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83628912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mert Toslali, S. Parthasarathy, Fábio Oliveira, Hai Huang, A. Coskun
{"title":"Iter8","authors":"Mert Toslali, S. Parthasarathy, Fábio Oliveira, Hai Huang, A. Coskun","doi":"10.1145/3472883.3486984","DOIUrl":"https://doi.org/10.1145/3472883.3486984","url":null,"abstract":"Online experimentation is an agile software development practice that plays an essential role in enabling rapid innovation. Existing solutions for online experimentation in Web and mobile applications are unsuitable for cloud applications. There is a need for rethinking online experimentation in the cloud to advance the state-of-the-art by considering the unique challenges posed by cloud environments. In this paper, we introduce Iter8, an open-source system that enables practitioners to deliver code changes to cloud applications in an agile manner while minimizing risk. Iter8 embodies our novel mathematical formulation built on online Bayesian learning and multi-armed bandit algorithms to enable online experimentation tailored for the cloud, considering both SLOs and business concerns, unlike existing solutions. Using Iter8, practitioners can safely and rapidly orchestrate various types of online experiments, gain key insights into the behavior of cloud applications, and roll out the optimal versions in an automated and statistically rigorous manner.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89576641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanxi Guo, Hao Wang, Tao Song, Yang Hua, Zhangcheng Lv, Xiulang Jin, Zhengui Xue, Ruhui Ma, Haibing Guan
{"title":"Siren: Byzantine-robust Federated Learning via Proactive Alarming","authors":"Hanxi Guo, Hao Wang, Tao Song, Yang Hua, Zhangcheng Lv, Xiulang Jin, Zhengui Xue, Ruhui Ma, Haibing Guan","doi":"10.1145/3472883.3486990","DOIUrl":"https://doi.org/10.1145/3472883.3486990","url":null,"abstract":"With the popularity of machine learning on many applications, data privacy has become a severe issue when machine learning is applied in the real world. Federated learning (FL), an emerging paradigm in machine learning, aims to train a centralized model while distributing training data among a large number of clients in order to avoid data privacy leaking, which has attracted great attention recently. However, the distributed training scheme in FL is susceptible to different kinds of attacks. Existing defense systems mainly utilize model weight analysis to identify malicious clients with many limitations. For example, some defense systems must know the exact number of malicious clients beforehand, which can be easily bypassed by well-designed attack methods and become impractical for real-world scenarios. This paper presents Siren, a Byzantine-robust federated learning system via a proactive alarming mechanism. Compared with current Byzantine-robust aggregation rules, Siren can defend against attacks from a higher proportion of malicious clients in the system while keeping the global model performing normally. Extensive experiments against different attack methods are conducted under diverse settings on both independent and identically distributed (IID) and non-IID data. The experimental results illustrate the effectiveness of Siren comparing with several state-of-the-art defense methods.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"2016 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86430026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mathias Lécuyer, Sang Hoon Kim, Mihir Nanavati, Junchen Jiang, S. Sen, Aleksandrs Slivkins, Amit Sharma
{"title":"Sayer: Using Implicit Feedback to Optimize System Policies","authors":"Mathias Lécuyer, Sang Hoon Kim, Mihir Nanavati, Junchen Jiang, S. Sen, Aleksandrs Slivkins, Amit Sharma","doi":"10.1145/3472883.3487001","DOIUrl":"https://doi.org/10.1145/3472883.3487001","url":null,"abstract":"We observe that many system policies that make threshold decisions involving a resource (e.g., time, memory, cores) naturally reveal additional, or implicit feedback. For example, if a system waits X min for an event to occur, then it automatically learns what would have happened if it waited < X min, because time has a cumulative property. This feedback tells us about alternative decisions, and can be used to improve the system policy. However, leveraging implicit feedback is difficult because it tends to be one-sided or incomplete, and may depend on the outcome of the event. As a result, existing practices for using feedback, such as simply incorporating it into a data-driven model, suffer from bias. We develop a methodology, called Sayer, that leverages implicit feedback to evaluate and train new system policies. Sayer builds on two ideas from reinforcement learning---randomized exploration and unbiased counterfactual estimators---to leverage data collected by an existing policy to estimate the performance of new candidate policies, without actually deploying those policies. Sayer uses implicit exploration and implicit data augmentation to generate implicit feedback in an unbiased form, which is then used by an implicit counterfactual estimator to evaluate and train new policies. The key idea underlying these techniques is to assign implicit probabilities to decisions that are not actually taken but whose feedback can be inferred; these probabilities are carefully calculated to ensure statistical unbiasedness. We apply Sayer to two production scenarios in Azure, and show that it can evaluate arbitrary policies accurately, and train new policies that outperform the production policies.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"124 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91423703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nalin Ranjan, Zechao Shang, Aaron J. Elmore, S. Krishnan
{"title":"Version Reconciliation for Collaborative Databases","authors":"Nalin Ranjan, Zechao Shang, Aaron J. Elmore, S. Krishnan","doi":"10.1145/3472883.3486980","DOIUrl":"https://doi.org/10.1145/3472883.3486980","url":null,"abstract":"We propose MindPalace, a prototype of a versioned database for efficient collaborative data management. MindPalace supports offline collaboration, where users work independently without real-time correspondence. The core of MindPalace is a critical step of offline collaboration: reconciling divergent branches made by simultaneous data manipulation. We formalize the concept of auto-mergeability, a condition under which branches may be reconciled without human intervention, and propose an efficient framework for determining whether two branches are auto-mergeable and identifying particular records for manual reconciliation.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85000079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Noman Bashir, Tianyou Guo, M. Hajiesmaili, David E. Irwin, P. Shenoy, R. Sitaraman, Abel Souza, A. Wierman
{"title":"Enabling Sustainable Clouds: The Case for Virtualizing the Energy System","authors":"Noman Bashir, Tianyou Guo, M. Hajiesmaili, David E. Irwin, P. Shenoy, R. Sitaraman, Abel Souza, A. Wierman","doi":"10.1145/3472883.3487009","DOIUrl":"https://doi.org/10.1145/3472883.3487009","url":null,"abstract":"Cloud platforms' growing energy demand and carbon emissions are raising concern about their environmental sustainability. The current approach to enabling sustainable clouds focuses on improving energy-efficiency and purchasing carbon offsets. These approaches have limits: many cloud data centers already operate near peak efficiency, and carbon offsets cannot scale to near zero carbon where there is little carbon left to offset. Instead, enabling sustainable clouds will require applications to adapt to when and where unreliable low-carbon energy is available. Applications cannot do this today because their energy use and carbon emissions are not visible to them, as the energy system provides the rigid abstraction of a continuous, reliable energy supply. This vision paper instead advocates for a \"carbon first\" approach to cloud design that elevates carbon-efficiency to a firs--class metric. To do so, we argue that cloud platforms should virtualize the energy system by exposing visibility into, and software-defined control of, it to applications, enabling them to define their own abstractions for managing energy and carbon emissions based on their own requirements.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87885473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chengliang Zhang, Junzhe Xia, Baichen Yang, Huancheng Puyang, W. Wang, Ruichuan Chen, I. E. Akkus, Paarijaat Aditya, Feng Yan
{"title":"Citadel: Protecting Data Privacy and Model Confidentiality for Collaborative Learning","authors":"Chengliang Zhang, Junzhe Xia, Baichen Yang, Huancheng Puyang, W. Wang, Ruichuan Chen, I. E. Akkus, Paarijaat Aditya, Feng Yan","doi":"10.1145/3472883.3486998","DOIUrl":"https://doi.org/10.1145/3472883.3486998","url":null,"abstract":"Many organizations own data but have limited machine learning expertise (data owners). On the other hand, organizations that have expertise need data from diverse sources to train truly generalizable models (model owners). With the advancement of machine learning (ML) and its growing awareness, the data owners would like to pool their data and collaborate with model owners, such that both entities can benefit from the obtained models. In such a collaboration, the data owners want to protect the privacy of its training data, while the model owners desire the confidentiality of the model and the training method that may contain intellectual properties. Existing private ML solutions, such as federated learning and split learning, cannot simultaneously meet the privacy requirements of both data and model owners. We present Citadel, a scalable collaborative ML system that protects both data and model privacy in untrusted infrastructures equipped with Intel SGX. Citadel performs distributed training across multiple training enclaves running on behalf of data owners and an aggregator enclave on behalf of the model owner. Citadel establishes a strong information barrier between these enclaves by zero-sum masking and hierarchical aggregation to prevent data/model leakage during collaborative training. Compared with existing SGX-protected systems, Citadel achieves better scalability and stronger privacy guarantees for collaborative ML. Cloud deployment with various ML models shows that Citadel scales to a large number of enclaves with less than 1.73X slowdown.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80328335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}