{"title":"Tarcil: reconciling scheduling speed and quality in large shared clusters","authors":"","doi":"10.1145/2806777.2806779","DOIUrl":"https://doi.org/10.1145/2806777.2806779","url":null,"abstract":"Scheduling diverse applications in large, shared clusters is particularly challenging. Recent research on cluster scheduling focuses either on scheduling speed, using sampling to quickly assign resources to tasks, or on scheduling quality, using centralized algorithms that search for the resources that improve both task performance and cluster utilization. We present Tarcil, a distributed scheduler that targets both scheduling speed and quality. Tarcil uses an analytically derived sampling framework that adjusts the sample size based on load, and provides statistical guarantees on the quality of allocated resources. It also implements admission control when sampling is unlikely to find suitable resources. This makes it appropriate for large, shared clusters hosting short- and long-running jobs. We evaluate Tarcil on clusters with hundreds of servers on EC2. For highly-loaded clusters running short jobs, Tarcil improves task execution time by 41% over a distributed, sampling-based scheduler. For more general scenarios, Tarcil achieves near-optimal performance for 4× and 2× more jobs than sampling-based and centralized schedulers respectively.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124217107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Achieving cost-efficient, data-intensive computing in the cloud","authors":"Michael Conley, Amin Vahdat, G. Porter","doi":"10.1145/2806777.2806781","DOIUrl":"https://doi.org/10.1145/2806777.2806781","url":null,"abstract":"Cloud computing providers have recently begun to offer high-performance virtualized flash storage and virtualized network I/O capabilities, which have the potential to increase application performance. Since users pay for only the resources they use, these new resources have the potential to lower overall cost. Yet achieving low cost requires choosing the right mixture of resources, which is only possible if their performance and scaling behavior is known. In this paper, we present a systematic measurement of recently introduced virtualized storage and network I/O within Amazon Web Services (AWS). Our experience shows that there are scaling limitations in clusters relying on these new features. As a result, provisioning for a large-scale cluster differs substantially from small-scale deployments. We describe the implications of this observation for achieving efficiency in large-scale cloud deployments. To confirm the value of our methodology, we deploy cost-efficient, high-performance sorting of 100 TB as a large-scale evaluation.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131887773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hui Lu, Brendan Saltaformaggio, R. Kompella, Dongyan Xu
{"title":"vFair: latency-aware fair storage scheduling via per-IO cost-based differentiation","authors":"Hui Lu, Brendan Saltaformaggio, R. Kompella, Dongyan Xu","doi":"10.1145/2806777.2806943","DOIUrl":"https://doi.org/10.1145/2806777.2806943","url":null,"abstract":"In virtualized data centers, multiple VMs are consolidated to access a shared storage system. Effective storage resource management, however, turns out to be challenging, as VM workloads exhibit various IO patterns and diverse loads. To multiplex the underlying hardware resources among VMs, providing fairness and isolation while maintaining high resource utilization becomes imperative for effective storage resource management. Existing schedulers such as Linux CFQ or SFQ can provide some fairness, but it has been observed that synchronous IO tends to lose fair shares significantly when competing with aggressive VMs. In this paper, we introduce vFair, a novel scheduling framework that achieves IO resource sharing fairness among VMs, regardless of their IO patterns and workloads. The design of vFair takes per-IO cost into consideration and strikes a balance between fairness and storage resource utilization. We have developed a Xen-based prototype of vFair and evaluated it with a wide range of storage workloads. Our results from both micro-benchmarks and real-world applications demonstrate the effectiveness of vFair, with significantly improved fairness and high resource utilization.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121300377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Henggang Cui, K. Keeton, Indrajit Roy, K. Viswanathan, G. Ganger
{"title":"Using data transformations for low-latency time series analysis","authors":"Henggang Cui, K. Keeton, Indrajit Roy, K. Viswanathan, G. Ganger","doi":"10.1145/2806777.2806839","DOIUrl":"https://doi.org/10.1145/2806777.2806839","url":null,"abstract":"Time series analysis is commonly used when monitoring data centers, networks, weather, and even human patients. In most cases, the raw time series data is massive, from millions to billions of data points, and yet interactive analyses require low (e.g., sub-second) latency. Aperture transforms raw time series data, during ingest, into compact summarized representations that it can use to efficiently answer queries at runtime. Aperture handles a range of complex queries, from correlating hundreds of lengthy time series to predicting anomalies in the data. Aperture achieves much of its high performance by executing queries on data summaries, while providing a bound on the information lost when transforming data. By doing so, Aperture can reduce query latency as well as the data that needs to be stored and analyzed to answer a query. Our experiments on real data show that Aperture can provide one to four orders of magnitude lower query response time, while incurring only 10% ingest time overhead and less than 20% error in accuracy.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125617357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lianghong Xu, Andrew Pavlo, S. Sengupta, Jin Li, G. Ganger
{"title":"Reducing replication bandwidth for distributed document databases","authors":"Lianghong Xu, Andrew Pavlo, S. Sengupta, Jin Li, G. Ganger","doi":"10.1145/2806777.2806840","DOIUrl":"https://doi.org/10.1145/2806777.2806840","url":null,"abstract":"With the rise of large-scale, Web-based applications, users are increasingly adopting a new class of document-oriented database management systems (DBMSs) that allow for rapid prototyping while also achieving scalable performance. Like for other distributed storage systems, replication is important for document DBMSs in order to guarantee availability. The network bandwidth required to keep replicas synchronized is expensive and is often a performance bottleneck. As such, there is a strong need to reduce the replication bandwidth, especially for geo-replication scenarios where wide-area network (WAN) bandwidth is limited. This paper presents a deduplication system called sDedup that reduces the amount of data transferred over the network for replicated document DBMSs. sDedup uses similarity-based deduplication to remove redundancy in replication data by delta encoding against similar documents selected from the entire database. It exploits key characteristics of document-oriented workloads, including small item sizes, temporal locality, and the incremental nature of document edits. Our experimental evaluation of sDedup with three real-world datasets shows that it is able to achieve up to 38X reduction in data sent over the network, significantly outperforming traditional chunk-based deduplication techniques while incurring negligible performance overhead.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123467641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Wu, Fan Yang, Jilong Xue, Wencong Xiao, Youshan Miao, Lan Wei, Haoxiang Lin, Yafei Dai, Lidong Zhou
{"title":"GraM: scaling graph computation to the trillions","authors":"Ming Wu, Fan Yang, Jilong Xue, Wencong Xiao, Youshan Miao, Lan Wei, Haoxiang Lin, Yafei Dai, Lidong Zhou","doi":"10.1145/2806777.2806849","DOIUrl":"https://doi.org/10.1145/2806777.2806849","url":null,"abstract":"GraM is an efficient and scalable graph engine for a large class of widely used graph algorithms. It is designed to scale up to multicores on a single server, as well as scale out to multiple servers in a cluster, offering significant, often over an order-of-magnitude, improvement over existing distributed graph engines on evaluated graph algorithms. GraM is also capable of processing graphs that are significantly larger than previously reported. In particular, using 64 servers (1,024 physical cores), it performs a PageRank iteration in 140 seconds on a synthetic graph with over one trillion edges, setting a new milestone for graph engines. GraM's efficiency and scalability comes from a judicious architectural design that exploits the benefits of multi-core and RDMA. GraM uses a simple message-passing based scaling architecture for both scaling up and scaling out to expose inherent parallelism. It further benefits from a specially designed multi-core aware RDMA-based communication stack that preserves parallelism in a balanced way and allows overlapping of communication and computation. A high degree of parallelism often comes at the cost of lower efficiency due to resource fragmentation. GraM is equipped with an adaptive mechanism that evaluates the cost and benefit of parallelism to decide the appropriate configuration. Combined, these mechanisms allow GraM to scale up and out with high efficiency.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128372829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ShardFS vs. IndexFS: replication vs. caching strategies for distributed metadata management in cloud storage systems","authors":"Lin Xiao, Kai Ren, Qing Zheng, Garth A. Gibson","doi":"10.1145/2806777.2806844","DOIUrl":"https://doi.org/10.1145/2806777.2806844","url":null,"abstract":"The rapid growth of cloud storage systems calls for fast and scalable namespace processing. While few commercial file systems offer anything better than federating individually non-scalable namespace servers, a recent academic file system, IndexFS, demonstrates scalable namespace processing based on client caching of directory entries and permissions (directory lookup state) with no per-client state in servers. In this paper we explore explicit replication of directory lookup state in all servers as an alternative to caching this information in all clients. Both eliminate most repeated RPCs to different servers in order to resolve hierarchical permission tests. Our realization for server replicated directory lookup state, ShardFS, employs a novel file system specific hybrid optimistic and pessimistic concurrency control favoring single object transactions over distributed transactions. Our experimentation suggests that if directory lookup state mutation is a fixed fraction of operations (strong scaling for metadata), server replication does not scale as well as client caching, but if directory lookup state mutation is proportional to the number of jobs, not the number of processes per job, (weak scaling for metadata), then server replication can scale more linearly than client caching and provide lower 70 percentile response times as well.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121659471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Manousakis, Íñigo Goiri, S. Sankar, Thu D. Nguyen, R. Bianchini
{"title":"CoolProvision: underprovisioning datacenter cooling","authors":"I. Manousakis, Íñigo Goiri, S. Sankar, Thu D. Nguyen, R. Bianchini","doi":"10.1145/2806777.2806938","DOIUrl":"https://doi.org/10.1145/2806777.2806938","url":null,"abstract":"Cloud providers have made significant strides in reducing the cooling capital and operational costs of their datacenters, for example, by leveraging outside air (\"free\") cooling where possible. Despite these advances, cooling costs still represent a significant expense mainly because cloud providers typically provision their cooling infrastructure for the worst-case scenario (i.e., very high load and outside temperature at the same time). Thus, in this paper, we propose to reduce cooling costs by underprovisioning the cooling infrastructure. When the cooling is underprovisioned, there might be (rare) periods when the cooling infrastructure cannot cool down the IT equipment enough. During these periods, we can either (1) reduce the processing capacity and potentially degrade the quality of service, or (2) let the IT equipment temperature increase in exchange for a controlled degradation in reliability. To determine the ideal amount of underprovisioning, we introduce CoolProvision, an optimization and simulation framework for selecting the cheapest provisioning within performance constraints defined by the provider. CoolProvision leverages an abstract trace of the expected workload, as well as cooling, performance, power, reliability, and cost models to explore the space of potential provisionings. Using data from a real small free-cooled datacenter, our results suggest that CoolProvision can reduce the cost of cooling by up to 55%. We extrapolate our experience and results to larger cloud datacenters as well.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"227 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131184548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Kubernetes and the path to cloud native","authors":"E. Brewer","doi":"10.1145/2806777.2809955","DOIUrl":"https://doi.org/10.1145/2806777.2809955","url":null,"abstract":"We are in the midst of an important shift to higher levels of abstraction than virtual machines. Kubernetes aims to simplify the deployment and management of services, including the construction of applications as sets of interacting but independent services. We explain some of the key concepts in Kubernetes and show how they work together to simplify evolution and scaling.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131054716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shahram Ghandeharizadeh, M. Balazinska, M. Freedman, Sumita Barahmand
{"title":"Proceedings of the Sixth ACM Symposium on Cloud Computing","authors":"Shahram Ghandeharizadeh, M. Balazinska, M. Freedman, Sumita Barahmand","doi":"10.1145/2806777","DOIUrl":"https://doi.org/10.1145/2806777","url":null,"abstract":"The stated scope of SoCC is to be broad and encompass diverse data management and systems topics, and this year's 34 accepted papers are no exception. They touch on a wide range of data systems topics including new architectures, scheduling, performance modeling, high availability, replication, elasticity, migration, costs and performance trade-offs, complex analysis, and testing. The conference also includes 2 poster sessions (with 30 posters in addition to invited poster presentations for the accepted papers), keynotes by Eric Brewer of Google/UC Berkeley and Samuel Madden of MIT, and a social program that includes a banquet and a luncheon for students and senior systems and database researchers. The symposium is co-located with the 41st International Conference on Very Large Databases, VLDB 2015, highlighting the synergy between big data and the cloud.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115606035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}