2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第9页

Automatic Graph Partitioning for Very Large-scale Deep Learning 用于大规模深度学习的自动图划分

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-03-30 DOI: 10.1109/IPDPS49936.2021.00109

Masahiro Tanaka, K. Taura, T. Hanawa, Kentaro Torisawa

{"title":"Automatic Graph Partitioning for Very Large-scale Deep Learning","authors":"Masahiro Tanaka, K. Taura, T. Hanawa, Kentaro Torisawa","doi":"10.1109/IPDPS49936.2021.00109","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00109","url":null,"abstract":"This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism. In recent deep learning research, as exemplified by T5 and GPT-3, the size of neural network models continues to grow. Since such models do not fit into the memory of accelerator devices, they need to be partitioned by model parallelism techniques. Moreover, to accelerate training for huge training data, we need a combination of model and data parallelisms, i.e., hybrid parallelism. Given a model description for PyTorch without any specification for model parallelism, RaNNC automatically partitions the model into a set of subcomponents so that (1) each subcomponent fits a device memory and (2) a high training throughput for pipeline parallelism is achieved by balancing the computation times of the subcomponents. Since the search space for partitioning models can be extremely large, RaNNC partitions a model through the following three phases. First, it identifies atomic subcomponents using simple heuristic rules. Next it groups them into coarser-grained blocks while balancing their computation times. Finally, it uses a novel dynamic programming-based algorithm to efficiently search for combinations of blocks to determine the final partitions. In our experiments, we compared RaNNC with two popular frameworks, Megatron-LM (hybrid parallelism) and GPipe (originally proposed for model parallelism, but a version allowing hybrid parallelism also exists), for training models with increasingly greater numbers of parameters. In the pre-training of enlarged BERT models, RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC’s training throughputs were comparable to Megatron-LM’s when pre-training the same models. RaNNC also achieved better training throughputs than GPipe on both the enlarged BERT model pre-training (GPipe with hybrid parallelism) and the enlarged ResNet models (GPipe with model parallelism) in all of the settings we tried. These results are remarkable, since RaNNC automatically partitions models without any modification to their descriptions; Megatron-LM and GPipe require users to manually rewrite the models’ descriptions.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122131967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Euler Meets GPU: Practical Graph Algorithms with Theoretical Guarantees 欧拉与GPU:具有理论保证的实用图算法

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-03-28 DOI: 10.1109/IPDPS49936.2021.00032

Adam Polak, Adrian Siwiec, Michal Stobierski

引用次数: 1

Extending Sparse Tensor Accelerators to Support Multiple Compression Formats 扩展稀疏张量加速器以支持多种压缩格式

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-03-18 DOI: 10.1109/IPDPS49936.2021.00110

Eric Qin, Geonhwa Jeong, William Won, Sheng-Chun Kao, Hyoukjun Kwon, S. Srinivasan, Dipankar Das, G. Moon, S. Rajamanickam, T. Krishna

引用次数: 8

Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths 通过执行路径的统计分析加速分布式内存自动调优

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-03-01 DOI: 10.1109/IPDPS49936.2021.00014

Edward Hutter, Edgar Solomonik

{"title":"Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths","authors":"Edward Hutter, Edgar Solomonik","doi":"10.1109/IPDPS49936.2021.00014","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00014","url":null,"abstract":"The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configuration’s performance by constructing confidence intervals to describe the performance of individual kernels (subroutines of benchmarked programs). Once a kernel’s performance is deemed sufficiently predictable for a set of inputs, subsequent invocations are avoided and replaced with a predictive model of the execution time. We then leverage online execution path analysis to coordinate selective kernel execution and propagate each kernel’s statistical profile. This strategy is effective in the presence of frequently-recurring computation and communication kernels, which is characteristic to algorithms in numerical linear algebra. We encapsulate this framework as part of a new profiling tool, Critter, that automates kernel execution decisions and propagates statistical profiles along critical paths of execution. We evaluate performance prediction accuracy obtained by our selective execution methods using state-of-the-art distributed-memory implementations of Cholesky and QR factorization on Stampede2, and demonstrate speed-ups of up to 7.1x with 98% prediction accuracy.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133562200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Byzantine Agreement with Unknown Participants and Failures 与未知参与者和失败的拜占庭协议

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-02-20 DOI: 10.1109/IPDPS49936.2021.00104

P. Khanchandani, Roger Wattenhofer

{"title":"Byzantine Agreement with Unknown Participants and Failures","authors":"P. Khanchandani, Roger Wattenhofer","doi":"10.1109/IPDPS49936.2021.00104","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00104","url":null,"abstract":"A set of mutually distrusting participants that want to agree on a common opinion must solve an instance of a Byzantine agreement problem. These problems have been extensively studied in the literature. However, most of the existing solutions assume that the participants are aware of n — the total number of participants in the system — and f — an upper bound on the number of Byzantine participants. In this paper, we show that most of the fundamental agreement problems can be solved without affecting resiliency even if the participants do not know the values of(possibly changing) n and f. Specifically, we consider a synchronous system where the participants have unique but not necessarily consecutive identifiers, and give Byzantine agreement algorithms for reliable broadcast, approximate agreement, rotor-coordinator, early terminating consensus and total ordering in static and dynamic systems, all with the optimal resiliency of n>3f. Moreover, we show that some synchrony is necessary as an agreement with probabilistic termination is impossible in a semi-synchronous or asynchronous system if the participants are unaware of n and f.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"224 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133350342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Byzantine Dispersion on Graphs 图上的拜占庭色散

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-02-15 DOI: 10.1109/IPDPS49936.2021.00103

A. R. Molla, Kaushik Mondal, W. Moses

{"title":"Byzantine Dispersion on Graphs","authors":"A. R. Molla, Kaushik Mondal, W. Moses","doi":"10.1109/IPDPS49936.2021.00103","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00103","url":null,"abstract":"This paper considers the problem of Byzantine dispersion and extends previous work along several parameters. The problem of Byzantine dispersion asks: given n robots, up to f of which are Byzantine, initially placed arbitrarily on an n node anonymous graph, design a terminating algorithm to be run by the robots such that they eventually reach a configuration where each node has at most one non-Byzantine robot on it. Previous work solved this problem for rings and tolerated up to n-1 Byzantine robots. In this paper, we investigate the problem on more general graphs. We first develop an algorithm that tolerates up to n-1 Byzantine robots and works for a more general class of graphs. We then develop an algorithm that works for any graph but tolerates a lesser number of Byzantine robots. We subsequently turn our focus to the strength of the Byzantine robots. Previous work considers only “weak” Byzantine robots that cannot fake their IDs. We develop an algorithm that solves the problem when Byzantine robots are not weak and can fake IDs. Finally, we study the situation where the number of the robots is not n but some k. We show that in such a scenario, the number of Byzantine robots that can be tolerated is severely restricted. Specifically, we show that it is impossible to deterministically solve Byzantine dispersion when $lceil k / nrceilgt lceil(k-f) / nrceil$.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126484556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Deep Reinforcement Agent for Scheduling in HPC 用于高性能计算调度的深度增强剂

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-02-11 DOI: 10.1109/IPDPS49936.2021.00090

Yuping Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, M. Papka

{"title":"Deep Reinforcement Agent for Scheduling in HPC","authors":"Yuping Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, M. Papka","doi":"10.1109/IPDPS49936.2021.00090","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00090","url":null,"abstract":"Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a novel, hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. A unique training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. The experiments with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 45%.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127263871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

High-Level FPGA Accelerator Design for Structured-Mesh-Based Explicit Numerical Solvers 基于结构网格的显式数值求解高级FPGA加速器设计

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-01-04 DOI: 10.1109/IPDPS49936.2021.00117

Kamalavasan Kamalakkannan, G. Mudalige, I. Reguly, Suhaib A. Fahmy

{"title":"High-Level FPGA Accelerator Design for Structured-Mesh-Based Explicit Numerical Solvers","authors":"Kamalavasan Kamalakkannan, G. Mudalige, I. Reguly, Suhaib A. Fahmy","doi":"10.1109/IPDPS49936.2021.00117","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00117","url":null,"abstract":"This paper presents a workflow for synthesizing near-optimal FPGA implementations of structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class and its computation-communication pattern and the architectural capabilities of the FPGA to accelerate solvers for high-performance computing applications. Key new features of the workflow are (1) the unification of standard state-of-the-art techniques with a number of high-gain optimizations such as batching and spatial blocking/tiling, motivated by increasing throughput for real-world workloads and (2) the development and use of a predictive analytical model to explore the design space, and obtain resource and performance estimates. Three representative applications are implemented using the design workflow on a Xilinx Alveo U280 FPGA, demonstrating near-optimal performance and over 85% predictive model accuracy. These are compared with equivalent highly-optimized implementations of the same applications on modern HPC-grade GPUs (Nvidia V100), analyzing time to solution, bandwidth, and energy consumption. Performance results indicate comparable runtimes with the V100 GPU, with over 2× energy savings for the largest non-trivial application on the FPGA. Our investigation shows the challenges of achieving high performance on current generation FPGAs compared to traditional architectures. We discuss determinants for a given stencil code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design and its resulting performance.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131978731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions 通过滑动通道卷积优化卷积神经网络

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-01-04 DOI: 10.1109/IPDPS49936.2021.00070

Yuke Wang, Boyuan Feng, Yufei Ding

{"title":"DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions","authors":"Yuke Wang, Boyuan Feng, Yufei Ding","doi":"10.1109/IPDPS49936.2021.00070","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00070","url":null,"abstract":"As the key advancement of the convolutional neural networks (CNNs), depthwise separable convolutions (DSCs) are becoming one of the most popular techniques to reduce the computations and parameters size of CNNs meanwhile maintaining the model accuracy. It also brings profound impact to improve the applicability of the compute- and memory-intensive CNNs to a broad range of applications, such as mobile devices, which are generally short of computation power and memory. However, previous research in DSCs are largely focusing on compositing the limited existing DSC designs, thus, missing the opportunities to explore more potential designs that can achieve better accuracy and higher computation/parameter reduction. Besides, the off-the-shelf convolution implementations offer limited computing schemes, therefore, lacking support for DSCs with different convolution patterns.To this end, we introduce, DSXplore, the first optimized design for exploring DSCs on CNNs. Specifically, at the algorithm level, DSXplore incorporates a novel factorized kernel–sliding-channel convolution (SCC), featured with input-channel overlapping to balance the accuracy performance and the reduction of computation and memory cost. SCC also offers enormous space for design exploration by introducing adjustable kernel parameters. Further, at the implementation level, we carry out an optimized GPU-implementation tailored for SCC by leveraging several key techniques, such as the input-centric backward design and the channel-cyclic optimization. Intensive experiments on different datasets across mainstream CNNs show the advantages of DSXplore in balancing accuracy and computation/parameter reduction over the standard convolution and the existing DSCs.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121000830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime 多语言GPU运行时中基于dag的多任务调度与资源共享

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-12-17 DOI: 10.1109/IPDPS49936.2021.00020

Alberto Parravicini, Arnaud Delamare, M. Arnaboldi, M. Santambrogio

引用次数: 7