2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第9页

An Experimental Study of Two-level Schwarz Domain-Decomposition Preconditioners on GPUs gpu上两级Schwarz域分解预处理的实验研究

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-04-10 DOI: 10.1109/IPDPS54959.2023.00073

I. Yamazaki, Alexander Heinlein, S. Rajamanickam

{"title":"An Experimental Study of Two-level Schwarz Domain-Decomposition Preconditioners on GPUs","authors":"I. Yamazaki, Alexander Heinlein, S. Rajamanickam","doi":"10.1109/IPDPS54959.2023.00073","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00073","url":null,"abstract":"The generalized Dryja–Smith–Widlund (GDSW) preconditioner is a two-level overlapping Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlapping Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the solution of sparse linear systems arising from the discretization of a wide range of partial different equations. In this paper, we present FROSch (Fast and Robust Schwarz), a domain decomposition solver package which implements GDSW-type preconditioners for both CPU and GPU clusters. To improve the solver performance on GPUs, we use a novel decomposition to run multiple MPI processes on each GPU, reducing both solver’s computational and storage costs and potentially improving the convergence rate. This allowed us to obtain competitive or faster performance using GPUs compared to using CPUs alone. We demonstrate the performance of FROSch on the Summit supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service (MPS) to implement our decomposition strategy.The solver has a wide variety of algorithmic and implementation choices, which poses both opportunities and challenges for its GPU implementation. We conduct a thorough experimental study with different solver options including the exact or inexact solution of the local overlapping subdomain problems on a GPU. We also discuss the effect of using the iterative variant of the incomplete LU factorization and sparse-triangular solve as the approximate local solver, and using lower precision for computing the whole FROSch preconditioner. Overall, the solve time was reduced by factors of about 2× using GPUs, while the GPU acceleration of the numerical setup time depend on the solver options and the local matrix sizes.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128185120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Dynasparse: Accelerating GNN Inference through Dynamic Sparsity Exploitation Dynasparse:通过动态稀疏性开发加速GNN推理

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-03-22 DOI: 10.1109/IPDPS54959.2023.00032

Bingyi Zhang, V. Prasanna

{"title":"Dynasparse: Accelerating GNN Inference through Dynamic Sparsity Exploitation","authors":"Bingyi Zhang, V. Prasanna","doi":"10.1109/IPDPS54959.2023.00032","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00032","url":null,"abstract":"Graph Neural Network (GNN) inference is used in many real-world applications. Data sparsity in GNN inference, including sparsity in the input graph and the GNN model, offer opportunities to further speed up inference. Also, many pruning techniques have been proposed for model compression that increase the data sparsity of GNNs.We propose Dynasparse, a comprehensive hardware-software codesign on FPGA to accelerate GNN inference through dynamic sparsity exploitation. For this, we decouple the GNN computation kernels from the basic computation primitives, and explore hardware-software codesign as follows: 1) Hardware design: We propose a novel unified accelerator design on FPGA to efficiently execute various computation primitives. We develop a customized soft processor that is tightly coupled with the accelerator to execute a runtime system. Moreover, we develop efficient hardware mechanisms to profile the data sparsity and perform on-the-fly data format transformation to prepare the input data for various computation primitives; 2) Software design: We develop a runtime system that works synergistically with the accelerator to perform dynamic kernel-to-primitive mapping based on data sparsity. We implement Dynasparse on a state-of-the-art FPGA platform, Xilinx Alveo U250, and evaluate the design using widely used GNN models (GCN, GraphSAGE, GIN and SGC). For the above GNN models and various input graphs, the proposed accelerator and dynamic kernel-to-primitive mapping reduces the inference latency by 3.73× on the average compared with the static mapping strategies employed in the state-of-the-art GNN accelerators. Compared with state-of-the-art CPU (GPU) implementations, Dynasparse achieves up to 56.9× (2.37×) speedup in end-to-end latency. Compared with state-of-the-art FPGA implementations, Dynasparse achieves 2.7× speedup in accelerator execution latency.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121729193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

RT-DBSCAN: Accelerating DBSCAN using Ray Tracing Hardware RT-DBSCAN:使用光线追踪硬件加速DBSCAN

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-03-16 DOI: 10.1109/IPDPS54959.2023.00100

Vani Nagarajan, Milind Kulkarni

引用次数: 1

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning MCR-DL:用于深度学习的混合匹配通信运行时

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-03-15 DOI: 10.1109/IPDPS54959.2023.00103

Quentin G. Anthony, A. Awan, Jeff Rasley, Yuxiong He, A. Shafi, M. Abduljabbar, H. Subramoni, D. Panda

{"title":"MCR-DL: Mix-and-Match Communication Runtime for Deep Learning","authors":"Quentin G. Anthony, A. Awan, Jeff Rasley, Yuxiong He, A. Shafi, M. Abduljabbar, H. Subramoni, D. Panda","doi":"10.1109/IPDPS54959.2023.00103","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00103","url":null,"abstract":"In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies [1], [2] to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. Examples of models using advanced parallelism strategies include Deep Learning Recommendation Models (DLRM) [3] and Mixture-of-Experts (MoE) [4], [5]. Communication libraries’ performance varies wildly across different communication operations, scales, and message sizes. We propose MCR-DL: an extensible DL communication framework that supports all point-to-point and collective operations while enabling users to dynamically mix-and-match communication backends for a given operation without deadlocks. MCR-DL also comes packaged with a tuning suite for dynamically selecting the best communication backend for a given input tensor. We select DeepSpeed-MoE and DLRM as candidate DL models and demonstrate a 31% improvement in DS-MoE throughput on 256 V100 GPUs on the Lassen HPC system. Further, we achieve a 20% throughput improvement in a dense Megatron-DeepSpeed model and a 25% throughput improvement in DLRM on 32 A100 GPUs with the Theta-GPU HPC system.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114905793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

GPU-enabled Function-as-a-Service for Machine Learning Inference 支持gpu的功能即服务，用于机器学习推理

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-03-09 DOI: 10.1109/IPDPS54959.2023.00096

Ming Zhao, Kritshekhar Jha, Sungho Hong

{"title":"GPU-enabled Function-as-a-Service for Machine Learning Inference","authors":"Ming Zhao, Kritshekhar Jha, Sungho Hong","doi":"10.1109/IPDPS54959.2023.00096","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00096","url":null,"abstract":"Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132066654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node Heterogeneous Architecture HyScale-GNN:基于单节点异构架构的可扩展混合GNN训练系统

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-03-01 DOI: 10.1109/IPDPS54959.2023.00062

Yi-Chien Lin, V. Prasanna

{"title":"HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node Heterogeneous Architecture","authors":"Yi-Chien Lin, V. Prasanna","doi":"10.1109/IPDPS54959.2023.00062","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00062","url":null,"abstract":"Graph Neural Networks (GNNs) have shown success in many real-world applications that involve graph-structured data. Most of the existing single-node GNN training systems are capable of training medium-scale graphs with tens of millions of edges; however, scaling them to large-scale graphs with billions of edges remains challenging. In addition, it is challenging to map GNN training algorithms onto a computation node as state-of-the-art machines feature heterogeneous architecture consisting of multiple processors and a variety of accelerators.We propose HyScale-GNN, a novel system to train GNN models on a single-node heterogeneous architecture. HyScale-GNN performs hybrid training which utilizes both the processors and the accelerators to train a model collaboratively. Our system design overcomes the memory size limitation of existing works and is optimized for training GNNs on large-scale graphs. We propose a two-stage data pre-fetching scheme to reduce the communication overhead during GNN training. To improve task mapping efficiency, we propose a dynamic resource management mechanism, which adjusts the workload assignment and resource allocation during runtime. We evaluate HyScale-GNN on a CPU-GPU and a CPU-FPGA heterogeneous architecture. Using several large-scale datasets and two widely-used GNN models, we compare the performance of our design with a multi-GPU baseline implemented in PyTorch-Geometric. The CPU-GPU design and the CPU-FPGA design achieve up to 2.08× speedup and 12.6× speedup, respectively. Compared with the state-of-the-art large-scale multi-node GNN training systems such as P3 and DistDGL, our CPU-FPGA design achieves up to 5.27× speedup using a single node.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125742642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Efficient Hardware Primitives for Immediate Memory Reclamation in Optimistic Data Structures 乐观数据结构中用于即时内存回收的高效硬件原语

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-02-25 DOI: 10.1109/IPDPS54959.2023.00021

Ajay Singh, Trevor Brown, Michael F. Spear

{"title":"Efficient Hardware Primitives for Immediate Memory Reclamation in Optimistic Data Structures","authors":"Ajay Singh, Trevor Brown, Michael F. Spear","doi":"10.1109/IPDPS54959.2023.00021","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00021","url":null,"abstract":"Safe memory reclamation (SMR) algorithms are crucial for preventing use-after-free errors in optimistic data structures. SMR algorithms typically delay reclamation for safety and reclaim objects in batches for efficiency. It is difficult to strike a balance between performance and space efficiency. Small batch sizes and frequent reclamation attempts lead to high overhead, while freeing large batches can lead to long program interruptions and high memory footprints. An ideal SMR algorithm would forgo batching, and reclaim memory immediately, without suffering high reclamation overheads.To this end, we propose Conditional Access: a set of hardware instructions that offer immediate reclamation and low overhead in optimistic data structures. Conditional Access harnesses cache coherence to enable threads to efficiently detect potential use-after-free errors without explicit shared memory communication, and without introducing additional coherence traffic.We implement and evaluate Conditional Access in Graphite, a multicore simulator. Our experiments show that Conditional Access can rival the performance of highly optimized and carefully tuned SMR algorithms while simultaneously allowing immediate reclamation. This results in concurrent data structures with similar memory footprints to their sequential counterparts.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"42 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114298463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

k-Center Clustering with Outliers in the MPC and Streaming Model MPC和流模型中具有离群值的k-中心聚类

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-02-24 DOI: 10.1109/IPDPS54959.2023.00090

M. D. Berg, Leyla Biabani, M. Monemizadeh

{"title":"k-Center Clustering with Outliers in the MPC and Streaming Model","authors":"M. D. Berg, Leyla Biabani, M. Monemizadeh","doi":"10.1109/IPDPS54959.2023.00090","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00090","url":null,"abstract":"Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set ${{mathcal{C}}^ * } = { c_1^ * , cdots ,c_k^ * } subseteq X$ of k centers such that the maximum distance of all but z points of P to their nearest center in C* is minimized. An (ε, k, z)-coreset for this problem is a weighted point set P* such that an optimal solution for the k-center problem with z outliers on P* gives a (1 ± ε)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ε ⩽ 1: In all cases, the size of the computed coreset is O(k/εd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using $O(sqrt n )$ machines, where the worker machines have $O(sqrt {nk/{varepsilon ^d}} + sqrt n cdot log (z + 1))$ local memory, and the coordinator has $O(sqrt {nk/{varepsilon ^d}} + sqrt n cdot log (z + 1) + z)$ local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ε, k, z)-coreset must use Ω(k/εd + z) space. We complement this by a deterministic streaming algorithm using O(k/εd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/εd + z)log4(kΔ/εδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/εd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ε)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/εd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh, ","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124298377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Engineering Massively Parallel MST Algorithms 工程大规模并行MST算法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-02-23 DOI: 10.1109/IPDPS54959.2023.00075

P. Sanders, M. Schimek

引用次数: 2

Engineering a Distributed-Memory Triangle Counting Algorithm 设计分布式内存三角形计数算法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-02-22 DOI: 10.1109/IPDPS54959.2023.00076

P. Sanders, Tim Niklas Uhl

引用次数: 2