Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing最新文献_第3页

Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters 多gpu集群上巨大图的可伸缩全对最短路径

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI: 10.1145/3431379.3460651

Piyush Sao, Hao Lu, R. Kannan, Vijay Thakkar, R. Vuduc, T. Potok

{"title":"Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters","authors":"Piyush Sao, Hao Lu, R. Kannan, Vijay Thakkar, R. Vuduc, T. Potok","doi":"10.1145/3431379.3460651","DOIUrl":"https://doi.org/10.1145/3431379.3460651","url":null,"abstract":"We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128605199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Machine Learning Augmented Hybrid Memory Management 机器学习增强混合内存管理

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI: 10.1145/3431379.3464450

Thaleia Dimitra Doudali, Ada Gavrilovska

{"title":"Machine Learning Augmented Hybrid Memory Management","authors":"Thaleia Dimitra Doudali, Ada Gavrilovska","doi":"10.1145/3431379.3464450","DOIUrl":"https://doi.org/10.1145/3431379.3464450","url":null,"abstract":"The integration of emerging non volatile memory hardware technologies into the main memory substrate, enables massive memory capacities at a reasonable cost in return for slower access speeds. This heterogeneity, along with the greater irregularity in the behavior of emerging workloads, render existing memory management approaches ineffective. This creates a significant gap between the realized vs. achievable performance and efficiency. At the same time, resource management solutions augmented with machine learning show great promise for fine-tuning system configuration knobs and predicting future behaviors. This thesis builds novel system-level mechanisms and reveals new insights towards the practical integration of machine learning in hybrid memory management. The specific contributions of this thesis is a machine learning augmented memory manager, coupled with insightful mechanisms to reduce the associated learning overheads and fine-tune critical operational parameters. The impact of this thesis is realizing an average of 3x application performance improvements and setting the new state-of-the-art in hybrid memory management.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125951057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CharminG 迷人的

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI: 10.1145/3431379.3464454

Jaemin Choi, D. Richards, L. Kalé

引用次数: 0

DLion

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI: 10.1145/3431379.3460643

Rankyung Hong, A. Chandra

{"title":"DLion","authors":"Rankyung Hong, A. Chandra","doi":"10.1145/3431379.3460643","DOIUrl":"https://doi.org/10.1145/3431379.3460643","url":null,"abstract":"Deep learning (DL) is a popular technique for building models from large quantities of data such as pictures, videos, messages generated from edges devices at rapid pace all over the world. It is often infeasible to migrate large quantities of data from the edges to centralized data center(s) over WANs for training due to privacy, cost, and performance reasons. At the same time, training large DL models on edge devices is infeasible due to their limited resources. An attractive alternative for DL training distributed data is to use micro-clouds---small-scale clouds deployed near edge devices in multiple locations. However, micro-clouds present the challenges of both computation and network resource heterogeneity as well as dynamism. In this paper, we introduce DLion, a new and generic decentralized distributed DL system designed to address the key challenges in micro-cloud environments, in order to reduce overall training time and improve model accuracy. We present three key techniques in DLion: (1) Weighted dynamic batching to maximize data parallelism for dealing with heterogeneous and dynamic compute capacity, (2) Per-link prioritized gradient exchange to reduce communication overhead for model updates based on available network capacity, and (3) Direct knowledge transfer to improve model accuracy by merging the best performing model parameters. We build a prototype of DLion on top of TensorFlow and show that DLion achieves up to 4.2X speedup in an Amazon GPU cluster, and up to 2X speed up and 26% higher model accuracy in a CPU cluster over four state-of-the-art distributed DL systems.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115900974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Jigsaw 拼图

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI: 10.4135/9781452232324.n9

Staci A. Smith, D. Lowenthal

引用次数: 0

LaSS: Running Latency Sensitive Serverless Computations at the Edge LaSS:在边缘运行对延迟敏感的无服务器计算

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI: 10.1145/3431379.3460646

Bin Wang, A. Ali-Eldin, P. Shenoy

{"title":"LaSS: Running Latency Sensitive Serverless Computations at the Edge","authors":"Bin Wang, A. Ali-Eldin, P. Shenoy","doi":"10.1145/3431379.3460646","DOIUrl":"https://doi.org/10.1145/3431379.3460646","url":null,"abstract":"Serverless computing has emerged as a new paradigm for running short-lived computations in the cloud. Due to its ability to handle IoT workloads, there has been considerable interest in running serverless functions at the edge. However, the constrained nature of the edge and the latency sensitive nature of workloads result in many challenges for serverless platforms. In this paper, we present LaSS, a platform that uses model-driven approaches for running latency-sensitive serverless computations on edge resources. LaSS uses principled queuing-based methods to determine an appropriate allocation for each hosted function and auto-scales the allocated resources in response to workload dynamics. LaSS uses a fair-share allocation approach to guarantee a minimum of allocated resources to each function in the presence of overload. In addition, it utilizes resource reclamation methods based on container deflation and termination to reassign resources from over-provisioned functions to under-provisioned ones. We implement a prototype of our approach on an OpenWhisk serverless edge cluster and conduct a detailed experimental evaluation. Our results show that LaSS can accurately predict the resources needed for serverless functions in the presence of highly dynamic workloads, and reprovision container capacity within hundreds of milliseconds while maintaining fair share allocation guarantees.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127615390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Parallel Program Scaling Analysis using Hardware Counters 使用硬件计数器的并行程序缩放分析

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI: 10.1145/3431379.3464453

Shobhit Jagga, Preeti Malakar

引用次数: 1

Using Pilot Jobs and CernVM File System for Simplified Use of Containers and Software Distribution 使用试点作业和CernVM文件系统简化容器和软件分发的使用

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI: 10.1145/3431379.3464451

N. Urs, M. Mambelli, D. Dykstra

{"title":"Using Pilot Jobs and CernVM File System for Simplified Use of Containers and Software Distribution","authors":"N. Urs, M. Mambelli, D. Dykstra","doi":"10.1145/3431379.3464451","DOIUrl":"https://doi.org/10.1145/3431379.3464451","url":null,"abstract":"High Energy Physics (HEP) experiments entail an abundance of computing resources, i.e. sites, to run simulations and analyses by processing data. This requirement is fulfilled by local batch farms, grid sites, private/commercial clouds, and supercomputing centers via High Throughput Computing (HTC). The growing needs of such experiments and resources being prone to trends of heterogeneity make it difficult for physicists to handle these resources directly. Additionally, HEP collaborations heavily rely on data and software releases, typically in the order of tens of gigabytes, while conducting simulations and analyses. Hence, aspects of scalability, reliability, and maintenance become crucial with regards to the distribution of the necessary data and software stack. The GlideinWMS [4] framework helps with the resource management problem by using pilot jobs, aka Glideins, to provision reliable elastic virtual clusters. Glideins are submitted to unreliable heterogeneous resources which are validated and customized by the Glideins to make the worker nodes available for end-user job execution. On the other hand, the CernVM File System (CernVM-FS or CVMFS) [1] helps with data distribution. It is a write-once, read-everywhere filesystem used to deploy scientific software to thousands of nodes on a worldwide distributed computing infrastructure. CVMFS is based on the Hyper Text Transfer Protocol and has been widely used within the particle physics community for (1) distributing experiment software and data such as calibrations, and (2) facilitating containerization by efficiently hosting container images along with providing containerization software, especially Singularity [3] GlideinWMS relies on CVMFS installed locally on the computing resources to satisfy the experiments' software needs. This requires system administrators' effort to install and maintain CVMFS at the sites and limits the use of sites, especially HPC resources, that do not have CVMFS installed. This poster presents a solution, taking advantage of Glideins to provide CVMFS at most sites without the need for a local installation. Doing so expands the pool of resources available for HEP experiments and reduces the effort of system administrators for current resources. Additionally, the proposed solution allows GlideinWMS to also start Singularity [3], a containerization software that can run unprivileged, on sites where neither CVMFS nor Singularity are available, including HPC sites. The benefits provided by this solution are: (1) lower overhead for site administrators in that they have less software to install, (2) an expanded pool of resources that run user jobs with easy access to software and data provided by CVMFS, thus making life easier for the scientists, and (3) improved flexibility to use HPC resources by enabling GlideinWMS pilot jobs to support HPC sites.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127471858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AITurbo

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI: 10.1145/3431379.3460639

Laiping Zhao, Fangshu Li, W. Qu, Kunlin Zhan, Qingman Zhang

{"title":"AITurbo","authors":"Laiping Zhao, Fangshu Li, W. Qu, Kunlin Zhan, Qingman Zhang","doi":"10.1145/3431379.3460639","DOIUrl":"https://doi.org/10.1145/3431379.3460639","url":null,"abstract":"As the scale and complexity of deep learning models continues to grow, model training is becoming an expensive job and only a small number of well-financed organizations can afford. Are the resources in commodity clusters well utilized for training? or how much potential space are still there for further improving the training efficiency in commodity clusters? is an urgent question to answer. In this paper, we review the processing of distributed learning training (DDL) in commodity GPU clusters and find that the current resource utilization is not only low but also imbalanced. We observe two features that can be exploited for further improving the training efficiency: partial predictable training and unified CPU-GPU training. Based on the observations, we present AITurbo, a novel resource scheduler that treats predictable and unpredictable jobs separately, but allocates heterogeneous CPU-GPU resource in a unified way. For predictable jobs, AITurbo designs a predicting model to estimate their performance under various heterogeneous resource allocations. For unpredictable jobs, it schedules them following the least-attained-service-first manner. AITurbo further designs a Borda-count based multi-level feedback queue method to combine them together. AITurbo demonstrates that there is still significant space for improving the training efficiency in commodity clusters. We evaluate AITurbo using jobs from Tensorflow benchmarks, which are submitted following the real trace of three production systems. Experimental results show that, compared with the state-of-the-art, AITurbo can reduce the average job completion time of DDL jobs by 3x.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130754854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

SnuRHAC

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-06-21 DOI: 10.1145/3431379.3460647

Jaehoon Jung, Daeyoung Park, Gangwon Jo, Jungho Park, Jaejin Lee

引用次数: 2