{"title":"On Max-min Fair Resource Allocation for Distributed Job Execution","authors":"Yitong Guan, Chuanyou Li, Xueyan Tang","doi":"10.1145/3337821.3337843","DOIUrl":"https://doi.org/10.1145/3337821.3337843","url":null,"abstract":"In modern data intensive computing, it is increasingly common for jobs to be executed in a distributed fashion across multiple machine clusters or datacenters to take advantage of data locality. This paper studies fair resource allocation among jobs requiring distributed execution. We extend conventional max-min fairness for resource allocation in a single machine or machine cluster to distributed job execution over multiple sites and define Aggregate Max-min Fairness (AMF) which requires the aggregate resource allocation across all sites to be max-min fair. We show that AMF satisfies the properties of Pareto efficiency, envy-freeness and strategy-proofness, but it does not necessarily satisfy the sharing incentive property. We propose an enhanced version of AMF to guarantee the sharing incentive property. We present algorithms to compute AMF allocations and propose an add-on to optimize the job completion times under AMF. Experimental results show that compared with a baseline which simply requires the resource allocation at each site to be max-min fair, AMF performs significantly better in balancing resource allocation and in job completion time, particularly when the workload distribution of jobs among sites is highly skewed.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117213089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Load Balancing in Hybrid Switching Data Center Networks with Converters","authors":"Jiaqi Zheng, Qiming Zheng, Xiaofeng Gao, Guihai Chen","doi":"10.1145/3337821.3337898","DOIUrl":"https://doi.org/10.1145/3337821.3337898","url":null,"abstract":"Today's data centers rely on scale-out architectures like fat-tree, BCube, VL2, etc. to connect a large number of commodity servers. It's important to balance the traffic load across the available links. Since the traditional electrical network cannot perfectly respond to the traffic variations in data centers, a growing trend is to introduce converters with adjustable optical links instead of adding more wiring links. However, little is known today about how to fully exploit the potential of the flexibility from the converters: the joint optimization on adjusting the optical links inside the converters and the routing in the whole network remains algorithmically challenging. In this paper, we initiate the study of dynamic load balancing problem (DLBP) in hybrid switching data center networks with converters. We design a set of specific converters for Diamond, VL2, BCube topologies to introduce more flexibility. Based on it, the connections of the optical links inside the converter and the route for each flow needs to be jointly optimized to minimize the maximum link utilization in the whole network. We formulate DLBP as an optimization program and prove that it's not only NP-hard, but also ρ-inapproximation. Further, we design a greedy algorithm to solve it. Extensive experiments show that our algorithm can reduce the traffic congestion by 12% on average.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127149837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs","authors":"Marco D'Amico, Ana Jokanovic, J. Corbalán","doi":"10.1145/3337821.3337909","DOIUrl":"https://doi.org/10.1145/3337821.3337909","url":null,"abstract":"In job scheduling, the concept of malleability has been explored since many years ago. Research shows that malleability improves system performance, but its utilization in HPC never became widespread. The causes are the difficulty in developing malleable applications, and the lack of support and integration of the different layers of the HPC software stack. However, in the last years, malleability in job scheduling is becoming more critical because of the increasing complexity of hardware and workloads. In this context, using nodes in an exclusive mode is not always the most efficient solution as in traditional HPC jobs, where applications were highly tuned for static allocations, but offering zero flexibility to dynamic executions. This paper proposes a new holistic, dynamic job scheduling policy, Slowdown Driven (SD-Policy), which exploits the malleability of applications as the key technology to reduce the average slowdown and response time of jobs. SD-Policy is based on backfill and node sharing. It applies malleability to running jobs to make room for jobs that will run with a reduced set of resources, only when the estimated slowdown improves over the static approach. We implemented SD-Policy in SLURM and evaluated it in a real production environment, and with a simulator using workloads of up to 198K jobs. Results show better resource utilization with the reduction of makespan, response time, slowdown, and energy consumption, up to respectively 7%, 50%, 70%, and 6%, for the evaluated workloads.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125554291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seonmyeong Bak, Yanfei Guo, P. Balaji, Vivek Sarkar
{"title":"Optimized Execution of Parallel Loops via User-Defined Scheduling Policies","authors":"Seonmyeong Bak, Yanfei Guo, P. Balaji, Vivek Sarkar","doi":"10.1145/3337821.3337913","DOIUrl":"https://doi.org/10.1145/3337821.3337913","url":null,"abstract":"On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach to extend the specification of parallel loops via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions when determining how to create chunks and schedule them on worker threads. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We demonstrate the benefits of this work using MiniMD, a miniapp derived from LAMMPS, and three kernels from the GAP Benchmark Suite: Breadth-First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules and 1.07× over the static_steal schedule from recent research.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129790209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Kurzak, Y. Tsai, M. Gates, A. Abdelfattah, J. Dongarra
{"title":"Massively Parallel Automated Software Tuning","authors":"J. Kurzak, Y. Tsai, M. Gates, A. Abdelfattah, J. Dongarra","doi":"10.1145/3337821.3337908","DOIUrl":"https://doi.org/10.1145/3337821.3337908","url":null,"abstract":"This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128630304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wujie Shao, Fei Xu, Li Chen, Haoyue Zheng, Fangming Liu
{"title":"Stage Delay Scheduling: Speeding up DAG-style Data Analytics Jobs with Resource Interleaving","authors":"Wujie Shao, Fei Xu, Li Chen, Haoyue Zheng, Fangming Liu","doi":"10.1145/3337821.3337872","DOIUrl":"https://doi.org/10.1145/3337821.3337872","url":null,"abstract":"To increase the resource utilization of datacenters, big data analytics jobs are commonly running stages in parallel which are organized into and scheduled according to the Directed Acyclic Graph (DAG). Through an in-depth analysis of the latest Alibaba cluster trace and our motivation experiments on Amazon EC2, however, we show that the CPU and network resources are still under-utilized due to the unwise stage scheduling, thereby prolonging the completion time of a DAG-style job (e.g., Spark). While existing works on reducing the job completion time focus on either task scheduling or job scheduling, stage scheduling has received comparably little attention. In this paper, we design and implement DelayStage, a simple yet effective stage delay scheduling strategy to interleave the cluster resources across the parallel stages, so as to increase the cluster resource utilization and speed up the job performance. With the aim of minimizing the makespan of parallel stages, DelayStage judiciously arranges the execution of stages in a pipelined manner to maximize the performance benefits of resource interleaving. Extensive prototype experiments on 30 Amazon EC2 instances and complementary trace-driven simulations show that DelayStage can improve the cluster resource utilization by up to 81.8% and reduce the job completion time by up to 41.3%, in comparison to the stock Spark and the state-of-the-art stage scheduling strategies, yet with acceptable runtime overhead.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131178369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TLB","authors":"Jinbin Hu, Jiawei Huang, Wenjun Lv, Weihe Li, Jianxin Wang, Tian He","doi":"10.1145/3337821.3337866","DOIUrl":"https://doi.org/10.1145/3337821.3337866","url":null,"abstract":"Modern datacenter topologies typically are multi-rooted trees consisting of multiple paths between any given pair of hosts. Recent load balancing designs focus on making full use of available parallel paths to provide high bisection bandwidth. However, they are agnostic to the mixed traffic generated by diverse applications in data centers and respectively use the same granularity in rerouting flows regardless of the flow type. Therefore, the short flows suffer the long-tailed queueing delay and reordering problems, while the throughputs of long flows are also degraded dramatically due to low link utilization and packet reordering under the non-adaptive granularity. To solve these problems, we design a traffic-aware load balancing (TLB) scheme to adopt different rerouting granularities for two kinds of flows. Specifically, TLB adaptively adjusts the switching granularity of long flows according to the load strength of short ones. Under the heavy load of short flows, the long flows use large switching granularity to help short ones obtain more opportunities in choosing short queues to complete quickly. When the load strength of short flows is low, the long flows switch paths more flexibly with small switching granularity to achieve high throughput. TLB is deployed at the switch, without any modifications on the end-hosts. The experimental results of NS2 simulations and Mininet implementation show that TLB significantly reduces the average flow completion time (AFCT) of short flows by ~15%-40% over the state-of-the-art load balancing schemes and achieves the high throughput for long flows.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124353888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Network Congestion-aware Online Service Function Chain Placement and Load Balancing","authors":"Xiaojun Shang, Zhenhua Liu, Yuanyuan Yang","doi":"10.1145/3337821.3337850","DOIUrl":"https://doi.org/10.1145/3337821.3337850","url":null,"abstract":"Emerging virtual network functions (VNFs) introduce new flexibility and scalability into traditional middlebox. Specifically, middleboxes are virtualized as software-based platforms running on commodity servers known as network points of presence (N-PoPs). Traditional network services are therefore realized by chained VNFs, i.e., service function chains (SFCs), running on potentially multiple N-PoPs. SFCs can be flexibly placed and routed to reduce operating cost. However, excessively pursuing low cost may incur congestion on some popular N-PoPs and links, which results in performance degradation or even violation of the service level of agreements. In this paper, we first propose an optimization problem for joint SFC placement and routing. Given the problem is NP-hard, we design an approximation algorithm named candidate path selection (CPS) with a theoretical performance guarantee. We then propose an online optimization problem for placement of SFCs with fast demand fluctuation. The problem concerns migration costs of VNFs between time slots, and we design an online candidate path selection (OCPS) algorithm to handle it. Extensive simulation results highlight that the CPS and OCPS algorithms provide efficient placement and routing of SFCs comparable to the optimal solution.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130524842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"COMBFT","authors":"Yingyao Rong, Weigang Wu, Zhiguang Chen","doi":"10.1145/3337821.3337885","DOIUrl":"https://doi.org/10.1145/3337821.3337885","url":null,"abstract":"Byzantine Fault-Tolerant (BFT) state machine replication protocol is an important building block for highly available distributed computing. This paper presents COMBFT, a BFT protocol that achieves both efficiency and robustness simultaneously. The major novelty of COMBFT lies in Conflicting-Order-Match (COM), a new request ordering mechanism that uses a new way to select the available sequence number for requests, and detects the possible malicious primary early. COM assigns sequence number based on request interference, and requires both primary and backup nodes to conduct request ordering, which can greatly reduce the impact of malicious primary and clients. When the backup suspects the primary may be malicious, it triggers an efficient commit protocol with two phases (i.e., suspect phase and commit phase) to further confirm whether the primary is malicious, and commit the request. The performance of COMBFT is evaluated via simulations and the results illustrate the outstanding performance of COMBFT in terms of throughput, latency and fault scalability.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123197193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuebing Li, Bingyang Liu, Yang Chen, Yu Xiao, Jiaxin Tang, Xin Wang
{"title":"Artemis","authors":"Xuebing Li, Bingyang Liu, Yang Chen, Yu Xiao, Jiaxin Tang, Xin Wang","doi":"10.1145/3337821.3337897","DOIUrl":"https://doi.org/10.1145/3337821.3337897","url":null,"abstract":"Today, Internet service deployment is typically implemented with server replication at multiple locations for the purpose of load balancing, failure tolerance, and user experience optimization. Domain name system (DNS) is responsible for translating human-readable domain names into network-routable IP addresses. When multiple replicas exist, upon the arrival of a query, DNS selects one replica and responds with its IP address. Thus, the delay caused by the process of DNS query including the selection of replica is part of the connection setup latency. In this paper, we proposed Artemis, a practical low-latency naming and routing system that aims at reducing the connection setup latency by eliminating the DNS query latency while keeping the ability to perform optimal server (replica) selection based on user-defined rules. Artemis achieves these goals by integrating name resolution into the transport layer handshake. Artemis allows clients to calculate locally the IP address of a Service Dispatcher, which serves as a proxy of hosting servers. Service Dispatchers forward the handshake request from a client to a server, and the response is embedded with the server's IP address back to the client. This enables clients to connect directly with servers afterward without querying DNS servers, and therefore eliminates the DNS query latency. Meanwhile, Artemis supports user-defined replica selection policies. We have implemented Artemis and evaluated its performance using the PlanetLab testbed and RIPE Atlas probes. Our results show that Artemis reduces the connection setup latency by 26.2% on average compared with the state-of-the-art.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"856 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114138079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}