Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing最新文献

筛选
英文 中文
Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters 多gpu集群上巨大图的可伸缩全对最短路径
Piyush Sao, Hao Lu, R. Kannan, Vijay Thakkar, R. Vuduc, T. Potok
{"title":"Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters","authors":"Piyush Sao, Hao Lu, R. Kannan, Vijay Thakkar, R. Vuduc, T. Potok","doi":"10.1145/3431379.3460651","DOIUrl":"https://doi.org/10.1145/3431379.3460651","url":null,"abstract":"We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128605199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Machine Learning Augmented Hybrid Memory Management 机器学习增强混合内存管理
Thaleia Dimitra Doudali, Ada Gavrilovska
{"title":"Machine Learning Augmented Hybrid Memory Management","authors":"Thaleia Dimitra Doudali, Ada Gavrilovska","doi":"10.1145/3431379.3464450","DOIUrl":"https://doi.org/10.1145/3431379.3464450","url":null,"abstract":"The integration of emerging non volatile memory hardware technologies into the main memory substrate, enables massive memory capacities at a reasonable cost in return for slower access speeds. This heterogeneity, along with the greater irregularity in the behavior of emerging workloads, render existing memory management approaches ineffective. This creates a significant gap between the realized vs. achievable performance and efficiency. At the same time, resource management solutions augmented with machine learning show great promise for fine-tuning system configuration knobs and predicting future behaviors. This thesis builds novel system-level mechanisms and reveals new insights towards the practical integration of machine learning in hybrid memory management. The specific contributions of this thesis is a machine learning augmented memory manager, coupled with insightful mechanisms to reduce the associated learning overheads and fine-tune critical operational parameters. The impact of this thesis is realizing an average of 3x application performance improvements and setting the new state-of-the-art in hybrid memory management.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125951057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CharminG 迷人的
Jaemin Choi, D. Richards, L. Kalé
{"title":"CharminG","authors":"Jaemin Choi, D. Richards, L. Kalé","doi":"10.1145/3431379.3464454","DOIUrl":"https://doi.org/10.1145/3431379.3464454","url":null,"abstract":"Host-driven execution of applications on modern GPU-accelerated systems suffer from frequent host-device synchronizations, data movement and limited flexibility in scheduling user tasks. We present CharminG, a runtime system designed to run entirely on the GPU without any interaction with the host. CharminG takes inspiration from the Charm++ parallel programming system and implements processor virtualization and message-driven execution on the GPU. We evaluate the composability and preliminary performance of CharminG with a proxy application that performs the Jacobi iterative method in a two-dimensional grid, using the Lassen supercomputer at Lawrence Livermore National Laboratory.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124463717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DLion
Rankyung Hong, A. Chandra
{"title":"DLion","authors":"Rankyung Hong, A. Chandra","doi":"10.1145/3431379.3460643","DOIUrl":"https://doi.org/10.1145/3431379.3460643","url":null,"abstract":"Deep learning (DL) is a popular technique for building models from large quantities of data such as pictures, videos, messages generated from edges devices at rapid pace all over the world. It is often infeasible to migrate large quantities of data from the edges to centralized data center(s) over WANs for training due to privacy, cost, and performance reasons. At the same time, training large DL models on edge devices is infeasible due to their limited resources. An attractive alternative for DL training distributed data is to use micro-clouds---small-scale clouds deployed near edge devices in multiple locations. However, micro-clouds present the challenges of both computation and network resource heterogeneity as well as dynamism. In this paper, we introduce DLion, a new and generic decentralized distributed DL system designed to address the key challenges in micro-cloud environments, in order to reduce overall training time and improve model accuracy. We present three key techniques in DLion: (1) Weighted dynamic batching to maximize data parallelism for dealing with heterogeneous and dynamic compute capacity, (2) Per-link prioritized gradient exchange to reduce communication overhead for model updates based on available network capacity, and (3) Direct knowledge transfer to improve model accuracy by merging the best performing model parameters. We build a prototype of DLion on top of TensorFlow and show that DLion achieves up to 4.2X speedup in an Amazon GPU cluster, and up to 2X speed up and 26% higher model accuracy in a CPU cluster over four state-of-the-art distributed DL systems.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115900974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Jigsaw 拼图
Staci A. Smith, D. Lowenthal
{"title":"Jigsaw","authors":"Staci A. Smith, D. Lowenthal","doi":"10.4135/9781452232324.n9","DOIUrl":"https://doi.org/10.4135/9781452232324.n9","url":null,"abstract":"","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114601335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LaSS: Running Latency Sensitive Serverless Computations at the Edge LaSS:在边缘运行对延迟敏感的无服务器计算
Bin Wang, A. Ali-Eldin, P. Shenoy
{"title":"LaSS: Running Latency Sensitive Serverless Computations at the Edge","authors":"Bin Wang, A. Ali-Eldin, P. Shenoy","doi":"10.1145/3431379.3460646","DOIUrl":"https://doi.org/10.1145/3431379.3460646","url":null,"abstract":"Serverless computing has emerged as a new paradigm for running short-lived computations in the cloud. Due to its ability to handle IoT workloads, there has been considerable interest in running serverless functions at the edge. However, the constrained nature of the edge and the latency sensitive nature of workloads result in many challenges for serverless platforms. In this paper, we present LaSS, a platform that uses model-driven approaches for running latency-sensitive serverless computations on edge resources. LaSS uses principled queuing-based methods to determine an appropriate allocation for each hosted function and auto-scales the allocated resources in response to workload dynamics. LaSS uses a fair-share allocation approach to guarantee a minimum of allocated resources to each function in the presence of overload. In addition, it utilizes resource reclamation methods based on container deflation and termination to reassign resources from over-provisioned functions to under-provisioned ones. We implement a prototype of our approach on an OpenWhisk serverless edge cluster and conduct a detailed experimental evaluation. Our results show that LaSS can accurately predict the resources needed for serverless functions in the presence of highly dynamic workloads, and reprovision container capacity within hundreds of milliseconds while maintaining fair share allocation guarantees.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127615390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Parallel Program Scaling Analysis using Hardware Counters 使用硬件计数器的并行程序缩放分析
Shobhit Jagga, Preeti Malakar
{"title":"Parallel Program Scaling Analysis using Hardware Counters","authors":"Shobhit Jagga, Preeti Malakar","doi":"10.1145/3431379.3464453","DOIUrl":"https://doi.org/10.1145/3431379.3464453","url":null,"abstract":"We present a lightweight library that automatically collects several hardware counters for MPI applications. We analyze the effect of strong and weak scaling on the counters. We first correlate the counter values obtained from each process count, and then cluster the counters to identify counters that are affected similarly due to scaling. We noted that the effect of last-level cache misses is more pronounced for some applications such as miniFE.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122212077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Using Pilot Jobs and CernVM File System for Simplified Use of Containers and Software Distribution 使用试点作业和CernVM文件系统简化容器和软件分发的使用
N. Urs, M. Mambelli, D. Dykstra
{"title":"Using Pilot Jobs and CernVM File System for Simplified Use of Containers and Software Distribution","authors":"N. Urs, M. Mambelli, D. Dykstra","doi":"10.1145/3431379.3464451","DOIUrl":"https://doi.org/10.1145/3431379.3464451","url":null,"abstract":"High Energy Physics (HEP) experiments entail an abundance of computing resources, i.e. sites, to run simulations and analyses by processing data. This requirement is fulfilled by local batch farms, grid sites, private/commercial clouds, and supercomputing centers via High Throughput Computing (HTC). The growing needs of such experiments and resources being prone to trends of heterogeneity make it difficult for physicists to handle these resources directly. Additionally, HEP collaborations heavily rely on data and software releases, typically in the order of tens of gigabytes, while conducting simulations and analyses. Hence, aspects of scalability, reliability, and maintenance become crucial with regards to the distribution of the necessary data and software stack. The GlideinWMS [4] framework helps with the resource management problem by using pilot jobs, aka Glideins, to provision reliable elastic virtual clusters. Glideins are submitted to unreliable heterogeneous resources which are validated and customized by the Glideins to make the worker nodes available for end-user job execution. On the other hand, the CernVM File System (CernVM-FS or CVMFS) [1] helps with data distribution. It is a write-once, read-everywhere filesystem used to deploy scientific software to thousands of nodes on a worldwide distributed computing infrastructure. CVMFS is based on the Hyper Text Transfer Protocol and has been widely used within the particle physics community for (1) distributing experiment software and data such as calibrations, and (2) facilitating containerization by efficiently hosting container images along with providing containerization software, especially Singularity [3] GlideinWMS relies on CVMFS installed locally on the computing resources to satisfy the experiments' software needs. This requires system administrators' effort to install and maintain CVMFS at the sites and limits the use of sites, especially HPC resources, that do not have CVMFS installed. This poster presents a solution, taking advantage of Glideins to provide CVMFS at most sites without the need for a local installation. Doing so expands the pool of resources available for HEP experiments and reduces the effort of system administrators for current resources. Additionally, the proposed solution allows GlideinWMS to also start Singularity [3], a containerization software that can run unprivileged, on sites where neither CVMFS nor Singularity are available, including HPC sites. The benefits provided by this solution are: (1) lower overhead for site administrators in that they have less software to install, (2) an expanded pool of resources that run user jobs with easy access to software and data provided by CVMFS, thus making life easier for the scientists, and (3) improved flexibility to use HPC resources by enabling GlideinWMS pilot jobs to support HPC sites.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127471858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AITurbo
Laiping Zhao, Fangshu Li, W. Qu, Kunlin Zhan, Qingman Zhang
{"title":"AITurbo","authors":"Laiping Zhao, Fangshu Li, W. Qu, Kunlin Zhan, Qingman Zhang","doi":"10.1145/3431379.3460639","DOIUrl":"https://doi.org/10.1145/3431379.3460639","url":null,"abstract":"As the scale and complexity of deep learning models continues to grow, model training is becoming an expensive job and only a small number of well-financed organizations can afford. Are the resources in commodity clusters well utilized for training? or how much potential space are still there for further improving the training efficiency in commodity clusters? is an urgent question to answer. In this paper, we review the processing of distributed learning training (DDL) in commodity GPU clusters and find that the current resource utilization is not only low but also imbalanced. We observe two features that can be exploited for further improving the training efficiency: partial predictable training and unified CPU-GPU training. Based on the observations, we present AITurbo, a novel resource scheduler that treats predictable and unpredictable jobs separately, but allocates heterogeneous CPU-GPU resource in a unified way. For predictable jobs, AITurbo designs a predicting model to estimate their performance under various heterogeneous resource allocations. For unpredictable jobs, it schedules them following the least-attained-service-first manner. AITurbo further designs a Borda-count based multi-level feedback queue method to combine them together. AITurbo demonstrates that there is still significant space for improving the training efficiency in commodity clusters. We evaluate AITurbo using jobs from Tensorflow benchmarks, which are submitted following the real trace of three production systems. Experimental results show that, compared with the state-of-the-art, AITurbo can reduce the average job completion time of DDL jobs by 3x.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130754854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
SnuRHAC
Jaehoon Jung, Daeyoung Park, Gangwon Jo, Jungho Park, Jaejin Lee
{"title":"SnuRHAC","authors":"Jaehoon Jung, Daeyoung Park, Gangwon Jo, Jungho Park, Jaejin Lee","doi":"10.1145/3431379.3460647","DOIUrl":"https://doi.org/10.1145/3431379.3460647","url":null,"abstract":"This paper proposes a framework called SnuRHAC, which provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code modification. SnuRHAC automatically distributes workload to multiple GPUs in a cluster and manages data across the nodes. To manage data efficiently, SnuRHAC extends CUDA Unified Memory and exploits its page fault mechanism. We also propose two prefetching techniques to fully exploit UM and to maximize performance. Static prefetching allows SnuRHAC to prefetch data by statically analyzing CUDA kernels. Dynamic prefetching complements static prefetching. SnuRHAC enforces an application to run on a single GPU if it is not suitable for multiple GPUs. We evaluate the performance of SnuRHAC using 18 benchmark applications from various sources. The evaluation result shows that while SnuRHAC significantly improves ease-of-programming, it shows scalable performance for the cluster environment depending on the application characteristics.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126371246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信