{"title":"Host Hypervisor Trace Mining for Virtual Machine Workload Characterization","authors":"Hani Nemati, S. V. Azhari, M. Dagenais","doi":"10.1109/IC2E.2019.00024","DOIUrl":"https://doi.org/10.1109/IC2E.2019.00024","url":null,"abstract":"The efficient operation and resource management of multi-tenant data centers hosting thousands of services is a demanding task, that requires precise and detailed information regarding the behaviour of each and every virtual machine (VM). Often, coarse measures such as CPU, memory, disk and network usage by VMs are considered in grouping them onto the same physical server, as detailed measures would require access to the guest operating system (OS), which is not feasible in a multi-tenant setting. In this paper, we propose host-level hypervisor tracing as a non-intrusive means to extract useful features, that can provide for fine grain characterization of VM behaviour. In particular, we extract VM blocking periods as well as virtual interrupt injection rates to detect multiple levels of resource intensiveness. In addition, we consider the resource contention rate due to other VMs and the host, along with reasons for exit from non-root to root privileged mode, revealing useful information about the nature of the underlying VM workload. We also use tracing to get information about the rate of process and thread preemption in each VM, extracting process and thread contention as another feature set. We then employ various feature selection strategies and assess the quality of the resulting workload clustering. Notably, we adopt a two-stage feature selection approach in addition to a one shot clustering scheme. Moreover, we consider inter-cluster and intra-cluster similarity metrics, such as the silhouette score, to discover distinct groups of workloads as well as workload groups with significant overlap. This information can be used by 1) data center administrators to gain deeper visibility into the nature of various VMs running on their infrastructure, 2) performance engineers to assist root cause analysis of VM issues and 3) IaaS providers to help in resource management based on VM behavior.","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130705032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Z. Dimitrijevic, Cetin Sahin, Christian Tinnefeld, J. Patvarczki
{"title":"Importance of Application-Level Resource Management in Multi-Cloud Deployments","authors":"Z. Dimitrijevic, Cetin Sahin, Christian Tinnefeld, J. Patvarczki","doi":"10.1109/IC2E.2019.00028","DOIUrl":"https://doi.org/10.1109/IC2E.2019.00028","url":null,"abstract":"Cloud service providers started with Infrastructure as a Service (IaaS) offerings and over time expanded into Platform as a Service (PaaS) and Software as a Service (SaaS). Even though each provider has a rich product offering, there are many scenarios where a multi-cloud strategy is desirable: utilizing economic dynamics, preventing data lock-in with one vendor, circumventing geographic restrictions, complying with local regulations, or combining on-premise and public-cloud resources. The challenge from a consumer perspective with multi-cloud deployments is the lack of a common abstraction for the offered products and a standardized way to express all of the application requirements for the resulting deployments. In this paper, we contribute by making yet another case for multi-cloud deployments and by predicting the emergence of a new generation of application-level resource managers which will natively support multi-cloud for enterprise applications. We identify three main components of the feedback loop controlled application-level resource managers: the software life-cycle manager, the data storage and access manager, and the service execution manager.","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122158892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ShadeNF: Testing Online Network Functions","authors":"Hui Lu, Abhinav Srivastava, Yu Sun","doi":"10.1109/IC2E.2019.00027","DOIUrl":"https://doi.org/10.1109/IC2E.2019.00027","url":null,"abstract":"The correct implementation of network policies for \"in-production\" network functions is critical, as it determines the security, availability and performance of a production network. Usually, conducting network testing for these network functions in a live production environment is attractive, as the production environment captures the most exact, realistic dynamic state and vulnerabilities of the system under test. However, doing so also brings potential risks of impacting or even damaging the production system. To address this tension, we present ShadeNF, a novel online platform for testing in-cloud network functions in a production-like environment, without disrupting the real production system. ShadeNF enables such a production-like environment with an exact live clone of production network functions and real production traffic as the test traffic. In designing and implementing ShadeNF, we address several key challenges and contribute new techniques in supporting such a testing platform, including an SDN-based live, consistent snapshot approach, a new programmable forwarding plane, and a scaled test traffic generator. We implement a ShadeNF prototype upon OpenStack and demonstrate that ShadeNF successfully captures the dynamics of production systems, and effectively localizes a range of policy violations in SDN/NFV systems.","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124576928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Information Models: Creating and Preserving Value in Volatile Cloud Resources","authors":"Chaojie Zhang, Varun Gupta, A. Chien","doi":"10.1109/IC2E.2019.00018","DOIUrl":"https://doi.org/10.1109/IC2E.2019.00018","url":null,"abstract":"Volatile resources are surplus cloud resources not consumed by high priority foreground (reserved/on-demand) load. These resources are exploited by a growing number of users. Today, cloud operators provide no statistical characterization of volatile resources. We consider how releasing such statistics could improve user value by studying Amazon's 608 EC2 Spot Instance types. Results show that as little as two parameters such as (average, 90pctile) can increase user value by 30%. These results are robust over four-fifths (475 of 608) of instance types. Beyond competitive concerns, cloud operators are reluctant to share volatile resource statistics because they might be considered a service-level agreement (SLA), and thus constrain their ability to serve foreground load. We show that clever resource management can allay such concerns. We study two plausible classes of foreground load changes, showing one class where such a concern is indeed valid and another where it is not. We design two online resource management algorithms that detect foreground load variation and adapt to maintain a statistical SLA. The algorithms not only improve the ability to maintain guarantees and user value but also improve user experience, reducing job failures by 50%. These results apply to the Stable and Transition classes of instance types, which account for nearly all of the instance types (577 of 608).","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131754280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SQUEET Program Committee","authors":"","doi":"10.1109/ic2e.2019.00-16","DOIUrl":"https://doi.org/10.1109/ic2e.2019.00-16","url":null,"abstract":"","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128825224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Continuous Benchmarking: Using System Benchmarking in Build Pipelines","authors":"M. Grambow, Fabian Lehmann, David Bermbach","doi":"10.1109/IC2E.2019.00039","DOIUrl":"https://doi.org/10.1109/IC2E.2019.00039","url":null,"abstract":"Continuous integration and deployment are established paradigms in modern software engineering. Both intend to ensure the quality of software products and to automate the testing and release process. Today's state of the art, however, focuses on functional tests or small microbenchmarks such as single method performance while the overall quality of service (QoS) is ignored. In this paper, we propose to add a dedicated benchmarking step into the testing and release process which can be used to ensure that QoS goals are met and that new system releases are at least as \"good\" as the previous ones. For this purpose, we present a research prototype which automatically deploys the system release, runs one or more benchmarks, collects and analyzes results, and decides whether the release fulfills predefined QoS goals. We evaluate our approach by replaying two years of Apache Cassandra's commit history.","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129952010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying-Feng Hsu, H. Kuwahara, Kazuhiro Matsuda, Morito Matsuoka
{"title":"Toward a Workload Allocation Optimizer for Power Saving in Data Centers","authors":"Ying-Feng Hsu, H. Kuwahara, Kazuhiro Matsuda, Morito Matsuoka","doi":"10.1109/IC2E.2019.00019","DOIUrl":"https://doi.org/10.1109/IC2E.2019.00019","url":null,"abstract":"The number and scale of data centers are both rapidly increasing due to a continuously growing demand for cloud computing services from many areas. Cloud computing infrastructure relies on a massive amount of HPC servers to process millions of tasks and consumes an enormous amount of power. The implementation of advanced task allocation technology provides a solution for energy efficiency and has therefore become an essential goal for data centers. In this paper, we propose a novel CPU-intensive workload allocation optimizer (WAO) for the task of power saving within data centers. There are three major contributions to this research. First, a data center monitoring module, which continually reports the latest status of the data center and stores operational data. Second, we propose an accurate and efficient server power prediction model for all servers in the HPC clusters. Third, we provide an optimal task assignment engine that evaluates and assigns tasks to the most appropriate server to facilitate minimal power consumption. Our experimental results show that our proposed WAO can obtain about 29.6% power savings and 26% more productivity in a real data center.","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128095908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aleksander Slominski, Vinod Muthusamy, Vatche Isahagian
{"title":"The Future of Computing is Boring (and that is exciting!)","authors":"Aleksander Slominski, Vinod Muthusamy, Vatche Isahagian","doi":"10.1109/IC2E.2019.00023","DOIUrl":"https://doi.org/10.1109/IC2E.2019.00023","url":null,"abstract":"We see a trend where computing becomes a metered utility similar to how the electric grid evolved. Initially electricity was generated locally but economies of scale (and standardization) made it more efficient and economical to have utility companies managing the electric grid. Similar developments can be seen in computing where scientific grids paved the way for commercial cloud computing offerings. However, in our opinion, that evolution is far from finished and in this paper we bring forward the remaining challenges and propose a vision for the future of computing. In particular we focus on diverging trends in the costs of computing and developer time, which suggests that future computing architectures will need to optimize for developer time.","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131299297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pradeep Ambati, David E. Irwin, P. Shenoy, Lixin Gao, A. Ali-Eldin, Jeannie R. Albrecht
{"title":"Understanding Synchronization Costs for Distributed ML on Transient Cloud Resources","authors":"Pradeep Ambati, David E. Irwin, P. Shenoy, Lixin Gao, A. Ali-Eldin, Jeannie R. Albrecht","doi":"10.1109/IC2E.2019.00029","DOIUrl":"https://doi.org/10.1109/IC2E.2019.00029","url":null,"abstract":"Cloud platforms often execute parallel batch applications, such as distributed machine learning (ML), that include numerous synchronization barriers. These barriers, which prevent any task from advancing beyond a specified point until all tasks have reached that point, significantly degrade application performance by reducing it to that of the slowest \"straggler\" task. To address the problem, researchers have proposed numerous straggler mitigation techniques, including speculatively re-executing straggler tasks and various relaxations of strict barrier semantics. While these techniques improve parallel application performance, they incur a cost in terms of the resources wasted re-executing tasks or waiting. Importantly, these costs, which are often implicit in prior work that targets dedicated resources, become explicit in the cloud, which charges for resources at fine-grained intervals. In addition, the cost difference between techniques is exacerbated in cloud platforms, since they charge substantially less for transient resources that effectively yield a probabilistic performance across a wide range. While transient resources' low list price is attractive, revocations increase the frequency and severity of stragglers, which decreases parallel job performance and increases overall execution cost. To better understand the cost of synchronization, we develop simple analytical models of different straggler mitigation techniques and compare their cost and performance on on-demand and transient resources. Our analysis shows that i) transient servers offer complex tradeoffs compared to on-demand servers, and can result in higher overall costs despite their highly discounted price due to their probabilistic performance; ii) common approaches to straggler mitigation, which is a well-studied problem, are less effective using transient servers that cause frequent and severe stragglers; and iii) a recent approach to flexible synchronization offers the best cost and performance.","PeriodicalId":226094,"journal":{"name":"2019 IEEE International Conference on Cloud Engineering (IC2E)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130683944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}