Chien-Chun Hung, G. Ananthanarayanan, L. Golubchik, Minlan Yu, Mingyang Zhang
{"title":"Wide-area analytics with multiple resources","authors":"Chien-Chun Hung, G. Ananthanarayanan, L. Golubchik, Minlan Yu, Mingyang Zhang","doi":"10.1145/3190508.3190528","DOIUrl":"https://doi.org/10.1145/3190508.3190528","url":null,"abstract":"Running data-parallel jobs across geo-distributed sites has emerged as a promising direction due to the growing need for geo-distributed cluster deployment. A key difference between geo-distributed and intra-cluster jobs is the heterogeneous (and often constrained) nature of compute and network resources across the sites. We propose Tetrium, a system for multi-resource allocation in geo-distributed clusters, that jointly considers both compute and network resources for task placement and job scheduling. Tetrium significantly reduces job response time, while incorporating several other performance goals with simple control knobs. Our EC2 deployment and trace-driven simulations suggest that Tetrium improves the average job response time by up to 78% compared to existing data-locality-based solutions, and up to 55% compared to Iridium, the recently proposed geo-distributed analytics system.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125057997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Iraklis Psaroudakis, Stefan Kaestle, Matthias Grimmer, D. Goodman, Jean-Pierre Lozi, T. Harris
{"title":"Analytics with smart arrays: adaptive and efficient language-independent data","authors":"Iraklis Psaroudakis, Stefan Kaestle, Matthias Grimmer, D. Goodman, Jean-Pierre Lozi, T. Harris","doi":"10.1145/3190508.3190514","DOIUrl":"https://doi.org/10.1145/3190508.3190514","url":null,"abstract":"This paper introduces smart arrays, an abstraction for providing adaptive and efficient language-independent data storage. Their smart functionalities include NUMA-aware data placement across sockets and bit compression. We show how our single C++ implementation can be used efficiently from both native C++ and compiled Java code. We experimentally evaluate smart arrays on a diverse set of C++ and Java analytics workloads. Further, we show how their smart functionalities affect performance and lead to differences in hardware resource demands on multicore machines, motivating the need for adaptivity. We observe that smart arrays can significantly decrease the memory space requirements of analytics workloads, and improve their performance by up to 4x. Smart arrays are the first step towards general smart collections with various smart functionalities that enable the consumption of hardware resources to be traded-off against one another.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127351146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laurent Bindschaedler, Jasmina Malicevic, Nicolas Schiper, Ashvin Goel, W. Zwaenepoel
{"title":"Rock you like a hurricane: taming skew in large scale analytics","authors":"Laurent Bindschaedler, Jasmina Malicevic, Nicolas Schiper, Ashvin Goel, W. Zwaenepoel","doi":"10.1145/3190508.3190532","DOIUrl":"https://doi.org/10.1145/3190508.3190532","url":null,"abstract":"Current cluster computing frameworks suffer from load imbalance and limited parallelism due to skewed data distributions, processing times, and machine speeds. We observe that the underlying cause for these issues in current systems is that they partition work statically. Hurricane is a high-performance large-scale data analytics system that successfully tames skew in novel ways. Hurricane performs adaptive work partitioning based on load observed by nodes at runtime. Overloaded nodes can spawn clones of their tasks at any point during their execution, with each clone processing a subset of the original data. This allows the system to adapt to load imbalance and dynamically adjust task parallelism to gracefully handle skew. We support this design by spreading data across all nodes and allowing nodes to retrieve data in a decentralized way. The result is that Hurricane automatically balances load across tasks, ensuring fast completion times. We evaluate Hurricane's performance on typical analytics workloads and show that it significantly outperforms state-of-the-art systems for both uniform and skewed datasets, because it ensures good CPU and storage utilization in all cases.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128815907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changwoo Min, Woon-Hak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim
{"title":"Solros: a data-centric operating system architecture for heterogeneous computing","authors":"Changwoo Min, Woon-Hak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim","doi":"10.1145/3190508.3190523","DOIUrl":"https://doi.org/10.1145/3190508.3190523","url":null,"abstract":"We propose Solros---a new operating system architecture for heterogeneous systems that comprises fast host processors, slow but massively parallel co-processors, and fast I/O devices. A general consensus to fully drive such a hardware system is to have a tight integration among processors and I/O devices. Thus, in the Solros architecture, a co-processor OS (data-plane OS) delegates its services, specifically I/O stacks, to the host OS (control-plane OS). Our observation for such a design is that global coordination with system-wide knowledge (e.g., PCIe topology, a load of each co-processor) and the best use of heterogeneous processors is critical to achieving high performance. Hence, we fully harness these specialized processors by delegating complex I/O stacks on fast host processors, which leads to an efficient global coordination at the level of the control-plane OS. We developed Solros with Xeon Phi co-processors and implemented three core OS services: transport, file system, and network services. Our experimental results show significant performance improvement compared with the stock Xeon Phi running the Linux kernel. For example, Solros improves the throughput of file system and network operations by 19x and 7x, respectively. Moreover, it improves the performance of two realistic applications: 19x for text indexing and 2x for image search.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130538756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Park, Alexey Tumanov, Angela H. Jiang, M. Kozuch, G. Ganger
{"title":"3Sigma: distribution-based cluster scheduling for runtime uncertainty","authors":"J. Park, Alexey Tumanov, Angela H. Jiang, M. Kozuch, G. Ganger","doi":"10.1145/3190508.3190515","DOIUrl":"https://doi.org/10.1145/3190508.3190515","url":null,"abstract":"The 3Sigma cluster scheduling system uses job runtime histories in a new way. Knowing how long each job will execute enables a scheduler to more effectively pack jobs with diverse time concerns (e.g., deadline vs. the-sooner-the-better) and placement preferences on heterogeneous cluster resources. But, existing schedulers use single-point estimates (e.g., mean or median of a relevant subset of historical runtimes), and we show that they are fragile in the face of real-world estimate error profiles. In particular, analysis of job traces from three different large-scale cluster environments shows that, while the runtimes of many jobs can be predicted well, even state-of-the-art predictors have wide error profiles with 8--23% of predictions off by a factor of two or more. Instead of reducing relevant history to a single point, 3Sigma schedules jobs based on full distributions of relevant runtime histories and explicitly creates plans that mitigate the effects of anticipated runtime uncertainty. Experiments with workloads derived from the same traces show that 3Sigma greatly outperforms a state-of-the-art scheduler that uses point estimates from a state-of-the-art predictor; in fact, the performance of 3Sigma approaches the end-to-end performance of a scheduler based on a hypothetical, perfect runtime predictor. 3Sigma reduces SLO miss rate, increases cluster goodput, and improves or matches latency for best effort jobs.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"8 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121003695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A frugal approach to reduce RCU grace period overhead","authors":"Aravinda Prasad, Kanchi Gopinath","doi":"10.1145/3190508.3190522","DOIUrl":"https://doi.org/10.1145/3190508.3190522","url":null,"abstract":"Grace period computation is a core part of the Read-Copy-Update (RCU) synchronization technique that determines the safe time to reclaim the deferred objects' memory. We first show that the eager grace period computation employed in the Linux kernel is appropriate only for enterprise workloads such as web and database servers where a large amount of reclaimable memory awaits the completion of a grace period. However, such memory is negligible in High-Performance Computing (HPC) and mostly idling environments due to limited OS kernel activity. Hence an eager approach is not only futile but also detrimental as the CPU cycles consumed to compute a grace period leads to jitter in HPC and frequent CPU wake-ups in idle environments. We design frugal grace periods, an economical grace period computation for non-enterprise environments that consume fewer CPU cycles. In addition, we reduce the number of grace periods either by using heuristics or by letting the memory allocator to explicitly request for a grace period only when it is running out of free objects. Our implementation in the Linux kernel reduces the number of grace periods by 68% to 99%, reduces the CPU time consumed by grace periods by 39% to 99%, improves the throughput by up to 28% for NAS parallel benchmarks and increases the CPU time spent in low power states by 2.4x when the system is idle.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128638268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flashabacus: a self-governing flash-based accelerator for low-power systems","authors":"Jie Zhang, Myoungsoo Jung","doi":"10.1145/3190508.3190544","DOIUrl":"https://doi.org/10.1145/3190508.3190544","url":null,"abstract":"Energy efficiency and computing flexibility are some of the primary design constraints of heterogeneous computing. In this paper, we present FlashAbacus, a data-processing accelerator that self-governs heterogeneous kernel executions and data storage accesses by integrating many flash modules in lightweight multiprocessors. The proposed accelerator can simultaneously process data from different applications with diverse types of operational functions, and it allows multiple kernels to directly access flash without the assistance of a host-level file system or an I/O runtime library. We prototype FlashAbacus on a multicore-based PCIe platform that connects to FPGA-based flash controllers with a 20 nm node process. The evaluation results show that FlashAbacus can improve the bandwidth of data processing by 127%, while reducing energy consumption by 78.4%, as compared to a conventional method of heterogeneous computing.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116672623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongzhi Chen, Miao Liu, Yunjian Zhao, Xiao Yan, Da Yan, James Cheng
{"title":"G-Miner: an efficient task-oriented graph mining system","authors":"Hongzhi Chen, Miao Liu, Yunjian Zhao, Xiao Yan, Da Yan, James Cheng","doi":"10.1145/3190508.3190545","DOIUrl":"https://doi.org/10.1145/3190508.3190545","url":null,"abstract":"Graph mining is one of the most important areas in data mining. However, scalable solutions for graph mining are still lacking as existing studies focus on sequential algorithms. While many distributed graph processing systems have been proposed in recent years, most of them were designed to parallelize computations such as PageRank and Breadth-First Search that keep states on individual vertices and propagate updates along edges. Graph mining, on the other hand, may generate many subgraphs whose number can far exceed the number of vertices. This inevitably leads to much higher computational and space complexity rendering existing graph systems inefficient. We propose G-Miner, a distributed system with a new architecture designed for general graph mining. G-Miner adopts a unified programming framework for implementing a wide range of graph mining algorithms. We model subgraph processing as independent tasks, and design a novel task pipeline to streamline task processing for better CPU, network and I/O utilization. Our extensive experiments validate the efficiency of G-Miner for a range of graph mining tasks.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125732738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mainak Ghosh, Ashwini Raina, Le Xu, Xiaoyao Qian, Indranil Gupta, H. Gupta
{"title":"Popular is cheaper: curtailing memory costs in interactive analytics engines","authors":"Mainak Ghosh, Ashwini Raina, Le Xu, Xiaoyao Qian, Indranil Gupta, H. Gupta","doi":"10.1145/3190508.3190542","DOIUrl":"https://doi.org/10.1145/3190508.3190542","url":null,"abstract":"This paper targets the growing area of interactive data analytics engines. We present a system called Getafix that intelligently decides replication levels and replica placement for data segments, in a way that is responsive to changing popularity of data access by incoming queries. We present an optimal solution to the static version of the problem, achieving minimality in both makespan and replication factor. Based on this intuition we build the Getafix system to handle queries and segments arriving in real time. We integrated Getafix into Druid, a modern open-source interactive data analytics engine. We present experimental results using workloads from Yahoo!'s production Druid cluster. Compared to existing work, Getafix achieves comparable query latency (both average and tail), while using 1.45--2.15 x less memory in a private cloud. In a public cloud, for a 100 TB hot dataset size, Getafix can cut dollar costs by as much as 10 million annually with negligible performance impact.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127567844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiran Li, Da Wei, Xiaoqi Chen, Ziheng Song, Ruihan Wu, Yuxing Li, Xin Jin, W. Xu
{"title":"DumbNet: a smart data center network fabric with dumb switches","authors":"Yiran Li, Da Wei, Xiaoqi Chen, Ziheng Song, Ruihan Wu, Yuxing Li, Xin Jin, W. Xu","doi":"10.1145/3190508.3190531","DOIUrl":"https://doi.org/10.1145/3190508.3190531","url":null,"abstract":"Today's data center networks have already pushed many functions to hosts. A fundamental question is how to divide functions between network and software. We present DumbNet, a new data center network architecture with no state in switches. DumbNet switches have no forwarding tables, no state, and thus require no configurations. Almost all control plane functions are pushed to hosts: they determine the entire path of a packet and then write the path as tags in the packet header. Switches only need to examine the tags to forward packets and monitor the port state. We design a set of host-based mechanisms to make the new architecture viable, from network bootstrapping and topology maintenance to network routing and failure handling. We build a prototype with 7 switches and 27 servers, as well as an FPGA-based switch. Extensive evaluations show that DumbNet achieves performance comparable to traditional networks, supports application-specific extensions like flowlet-based traffic engineering, and stays extremely simple and easy-to-manage.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123197714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}