Masahiro Tanaka, K. Taura, T. Hanawa, Kentaro Torisawa
{"title":"Automatic Graph Partitioning for Very Large-scale Deep Learning","authors":"Masahiro Tanaka, K. Taura, T. Hanawa, Kentaro Torisawa","doi":"10.1109/IPDPS49936.2021.00109","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00109","url":null,"abstract":"This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism. In recent deep learning research, as exemplified by T5 and GPT-3, the size of neural network models continues to grow. Since such models do not fit into the memory of accelerator devices, they need to be partitioned by model parallelism techniques. Moreover, to accelerate training for huge training data, we need a combination of model and data parallelisms, i.e., hybrid parallelism. Given a model description for PyTorch without any specification for model parallelism, RaNNC automatically partitions the model into a set of subcomponents so that (1) each subcomponent fits a device memory and (2) a high training throughput for pipeline parallelism is achieved by balancing the computation times of the subcomponents. Since the search space for partitioning models can be extremely large, RaNNC partitions a model through the following three phases. First, it identifies atomic subcomponents using simple heuristic rules. Next it groups them into coarser-grained blocks while balancing their computation times. Finally, it uses a novel dynamic programming-based algorithm to efficiently search for combinations of blocks to determine the final partitions. In our experiments, we compared RaNNC with two popular frameworks, Megatron-LM (hybrid parallelism) and GPipe (originally proposed for model parallelism, but a version allowing hybrid parallelism also exists), for training models with increasingly greater numbers of parameters. In the pre-training of enlarged BERT models, RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC’s training throughputs were comparable to Megatron-LM’s when pre-training the same models. RaNNC also achieved better training throughputs than GPipe on both the enlarged BERT model pre-training (GPipe with hybrid parallelism) and the enlarged ResNet models (GPipe with model parallelism) in all of the settings we tried. These results are remarkable, since RaNNC automatically partitions models without any modification to their descriptions; Megatron-LM and GPipe require users to manually rewrite the models’ descriptions.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122131967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Euler Meets GPU: Practical Graph Algorithms with Theoretical Guarantees","authors":"Adam Polak, Adrian Siwiec, Michal Stobierski","doi":"10.1109/IPDPS49936.2021.00032","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00032","url":null,"abstract":"The Euler tour technique is a classical tool for designing parallel graph algorithms, originally proposed for the PRAM model. We ask whether it can be adapted to run efficiently on GPU. We focus on two established applications of the technique: (1) the problem of finding lowest common ancestors (LCA) of pairs of nodes in trees, and (2) the problem of finding bridges in undirected graphs. In our experiments, we compare theoretically optimal algorithms using the Euler tour technique against simpler heuristics supposed to perform particularly well on typical instances. We show that the Euler tour-based algorithms not only fulfill their theoretical promises and outperform practical heuristics on hard instances, but also perform on par with them on easy instances.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129453436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric Qin, Geonhwa Jeong, William Won, Sheng-Chun Kao, Hyoukjun Kwon, S. Srinivasan, Dipankar Das, G. Moon, S. Rajamanickam, T. Krishna
{"title":"Extending Sparse Tensor Accelerators to Support Multiple Compression Formats","authors":"Eric Qin, Geonhwa Jeong, William Won, Sheng-Chun Kao, Hyoukjun Kwon, S. Srinivasan, Dipankar Das, G. Moon, S. Rajamanickam, T. Krishna","doi":"10.1109/IPDPS49936.2021.00110","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00110","url":null,"abstract":"Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enabled by them vary across tensor dimensions and amount of sparsity. Since DL and scientific workloads span across all sparsity regions, there can be numerous format combinations for optimizing memory and compute efficiency. Unfortunately, many proposed accelerators operate on one or two fixed format combinations. This work proposes hardware extensions to accelerators for supporting numerous format combinations seamlessly and demonstrates $sim 4 times$ speedup over performing format conversions in software.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123887099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths","authors":"Edward Hutter, Edgar Solomonik","doi":"10.1109/IPDPS49936.2021.00014","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00014","url":null,"abstract":"The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configuration’s performance by constructing confidence intervals to describe the performance of individual kernels (subroutines of benchmarked programs). Once a kernel’s performance is deemed sufficiently predictable for a set of inputs, subsequent invocations are avoided and replaced with a predictive model of the execution time. We then leverage online execution path analysis to coordinate selective kernel execution and propagate each kernel’s statistical profile. This strategy is effective in the presence of frequently-recurring computation and communication kernels, which is characteristic to algorithms in numerical linear algebra. We encapsulate this framework as part of a new profiling tool, Critter, that automates kernel execution decisions and propagates statistical profiles along critical paths of execution. We evaluate performance prediction accuracy obtained by our selective execution methods using state-of-the-art distributed-memory implementations of Cholesky and QR factorization on Stampede2, and demonstrate speed-ups of up to 7.1x with 98% prediction accuracy.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133562200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Byzantine Agreement with Unknown Participants and Failures","authors":"P. Khanchandani, Roger Wattenhofer","doi":"10.1109/IPDPS49936.2021.00104","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00104","url":null,"abstract":"A set of mutually distrusting participants that want to agree on a common opinion must solve an instance of a Byzantine agreement problem. These problems have been extensively studied in the literature. However, most of the existing solutions assume that the participants are aware of n — the total number of participants in the system — and f — an upper bound on the number of Byzantine participants. In this paper, we show that most of the fundamental agreement problems can be solved without affecting resiliency even if the participants do not know the values of(possibly changing) n and f. Specifically, we consider a synchronous system where the participants have unique but not necessarily consecutive identifiers, and give Byzantine agreement algorithms for reliable broadcast, approximate agreement, rotor-coordinator, early terminating consensus and total ordering in static and dynamic systems, all with the optimal resiliency of n>3f. Moreover, we show that some synchrony is necessary as an agreement with probabilistic termination is impossible in a semi-synchronous or asynchronous system if the participants are unaware of n and f.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"224 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133350342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Byzantine Dispersion on Graphs","authors":"A. R. Molla, Kaushik Mondal, W. Moses","doi":"10.1109/IPDPS49936.2021.00103","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00103","url":null,"abstract":"This paper considers the problem of Byzantine dispersion and extends previous work along several parameters. The problem of Byzantine dispersion asks: given n robots, up to f of which are Byzantine, initially placed arbitrarily on an n node anonymous graph, design a terminating algorithm to be run by the robots such that they eventually reach a configuration where each node has at most one non-Byzantine robot on it. Previous work solved this problem for rings and tolerated up to n-1 Byzantine robots. In this paper, we investigate the problem on more general graphs. We first develop an algorithm that tolerates up to n-1 Byzantine robots and works for a more general class of graphs. We then develop an algorithm that works for any graph but tolerates a lesser number of Byzantine robots. We subsequently turn our focus to the strength of the Byzantine robots. Previous work considers only “weak” Byzantine robots that cannot fake their IDs. We develop an algorithm that solves the problem when Byzantine robots are not weak and can fake IDs. Finally, we study the situation where the number of the robots is not n but some k. We show that in such a scenario, the number of Byzantine robots that can be tolerated is severely restricted. Specifically, we show that it is impossible to deterministically solve Byzantine dispersion when $lceil k / nrceilgt lceil(k-f) / nrceil$.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126484556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuping Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, M. Papka
{"title":"Deep Reinforcement Agent for Scheduling in HPC","authors":"Yuping Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, M. Papka","doi":"10.1109/IPDPS49936.2021.00090","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00090","url":null,"abstract":"Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a novel, hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. A unique training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. The experiments with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 45%.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127263871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kamalavasan Kamalakkannan, G. Mudalige, I. Reguly, Suhaib A. Fahmy
{"title":"High-Level FPGA Accelerator Design for Structured-Mesh-Based Explicit Numerical Solvers","authors":"Kamalavasan Kamalakkannan, G. Mudalige, I. Reguly, Suhaib A. Fahmy","doi":"10.1109/IPDPS49936.2021.00117","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00117","url":null,"abstract":"This paper presents a workflow for synthesizing near-optimal FPGA implementations of structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class and its computation-communication pattern and the architectural capabilities of the FPGA to accelerate solvers for high-performance computing applications. Key new features of the workflow are (1) the unification of standard state-of-the-art techniques with a number of high-gain optimizations such as batching and spatial blocking/tiling, motivated by increasing throughput for real-world workloads and (2) the development and use of a predictive analytical model to explore the design space, and obtain resource and performance estimates. Three representative applications are implemented using the design workflow on a Xilinx Alveo U280 FPGA, demonstrating near-optimal performance and over 85% predictive model accuracy. These are compared with equivalent highly-optimized implementations of the same applications on modern HPC-grade GPUs (Nvidia V100), analyzing time to solution, bandwidth, and energy consumption. Performance results indicate comparable runtimes with the V100 GPU, with over 2× energy savings for the largest non-trivial application on the FPGA. Our investigation shows the challenges of achieving high performance on current generation FPGAs compared to traditional architectures. We discuss determinants for a given stencil code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design and its resulting performance.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131978731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions","authors":"Yuke Wang, Boyuan Feng, Yufei Ding","doi":"10.1109/IPDPS49936.2021.00070","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00070","url":null,"abstract":"As the key advancement of the convolutional neural networks (CNNs), depthwise separable convolutions (DSCs) are becoming one of the most popular techniques to reduce the computations and parameters size of CNNs meanwhile maintaining the model accuracy. It also brings profound impact to improve the applicability of the compute- and memory-intensive CNNs to a broad range of applications, such as mobile devices, which are generally short of computation power and memory. However, previous research in DSCs are largely focusing on compositing the limited existing DSC designs, thus, missing the opportunities to explore more potential designs that can achieve better accuracy and higher computation/parameter reduction. Besides, the off-the-shelf convolution implementations offer limited computing schemes, therefore, lacking support for DSCs with different convolution patterns.To this end, we introduce, DSXplore, the first optimized design for exploring DSCs on CNNs. Specifically, at the algorithm level, DSXplore incorporates a novel factorized kernel–sliding-channel convolution (SCC), featured with input-channel overlapping to balance the accuracy performance and the reduction of computation and memory cost. SCC also offers enormous space for design exploration by introducing adjustable kernel parameters. Further, at the implementation level, we carry out an optimized GPU-implementation tailored for SCC by leveraging several key techniques, such as the input-centric backward design and the channel-cyclic optimization. Intensive experiments on different datasets across mainstream CNNs show the advantages of DSXplore in balancing accuracy and computation/parameter reduction over the standard convolution and the existing DSCs.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121000830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alberto Parravicini, Arnaud Delamare, M. Arnaboldi, M. Santambrogio
{"title":"DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime","authors":"Alberto Parravicini, Arnaud Delamare, M. Arnaboldi, M. Santambrogio","doi":"10.1109/IPDPS49936.2021.00020","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00020","url":null,"abstract":"GPUs are readily available in cloud computing and personal devices, but their use for data processing acceleration has been slowed down by their limited integration with common programming languages such as Python or Java. Moreover, using GPUs to their full capabilities requires expert knowledge of asynchronous programming. In this work, we present a novel GPU run time scheduler for multi-task GPU computations that transparently provides asynchronous execution, space-sharing, and transfer-computation overlap without requiring in advance any information about the program dependency structure. We leverage the GrCUDA polyglot API to integrate our scheduler with multiple high-level languages and provide a platform for fast prototyping and easy GPU acceleration. We validate our work on 6 benchmarks created to evaluate task-parallelism and show an average of 44% speedup against synchronous execution, with no execution time slowdown compared to hand-optimized host code written using the C++ CUDA Graphs API.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132552833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}