P. Rajak, Anikeya Aditya, S. Fukushima, R. Kalia, Thomas M Linker, Kuang Liu, Ye Luo, A. Nakano, K. Nomura, K. Shimamura, F. Shimojo, P. Vashishta
{"title":"Ex-NNQMD: Extreme-Scale Neural Network Quantum Molecular Dynamics","authors":"P. Rajak, Anikeya Aditya, S. Fukushima, R. Kalia, Thomas M Linker, Kuang Liu, Ye Luo, A. Nakano, K. Nomura, K. Shimamura, F. Shimojo, P. Vashishta","doi":"10.1109/IPDPSW52791.2021.00145","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00145","url":null,"abstract":"Deep learning is revolutionizing countless scientific and engineering fields. In particular, SC20 Gordon Bell award represented a breakthrough in molecular simulation, i.e., 100-million-atom simulation with quantum-mechanical accuracy on the Summit supercomputer at ORNL, using deep potential molecular dynamics (MD). Moving forward, while these simulations were performed only in gentle equilibrium conditions, far-from-equilibrium MD simulation involving light-induced electronic excited states finds numerous scientific and engineering applications. However, it remains a challenge to perform such far-from-equilibrium simulations at larger spatiotemporal scales, where growing number of unphysical predictions of interatomic force prohibits simulations involving larger numbers of atoms for longer times. In this paper, we propose a physically-based inductive bias, maximally-preserved Maxwell-Boltzmann (MPMB), to overcome this fidelity-scaling problem. Along with hybrid divide-and-conquer parallelization and single-node level optimization using multithreading and data parallel SIMD, the resulting Ex-NNQMD (extreme-scale neural network quantum molecular dynamics) algorithm has achieved unprecedented scales of far-from-equilibrium simulations: 1) 5.1-billion atom system with a parallel efficiency of 0.94, and 2) a sustained performance of 6.4 nanoseconds/day for 10-million atom system both on 262,144 cores of the Theta supercomputer at Argonne Leadership Computing Facility. Extended fidelity scaling and efficient parallelization have allowed us for the first time to study light-induced ferroelectric switching under extreme electronic excitation at experimentally relevant spatiotemporal scales with accuracy.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127654814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Barrachina, Adrián Castelló, M. Catalán, M. F. Dolz, José I. Mestre
{"title":"A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks","authors":"S. Barrachina, Adrián Castelló, M. Catalán, M. F. Dolz, José I. Mestre","doi":"10.1109/IPDPSW52791.2021.00110","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00110","url":null,"abstract":"We present PyDTNN, a framework for training deep neural networks (DNNs) on clusters of computers that has been designed as a research-oriented tool with a low learning curve. Our parallel training framework offers a set of functionalities that cover several must-have features for advanced deep learning (DL) software: 1) it is developed in Python in order to expose an accessible entry point for the newcomer; 2) it is extensible, allowing users to prototype new research ideas without requiring them to deal with complex software-stacks; and 3) it delivers high parallel performance, exploiting MPI via mpi4py/NCCL for communication; and NumPy, cuDNN, and cuBLAS for computation.This paper provides practical evidence that PyDTNN attains similar accuracy and parallel performance to those exhibited by Google’s TensorFlow (TF), though we recognize that PyDTNN cannot compete with a production-level framework such as TF or PyTorch in terms of maturity and functionality. Instead, PyDTNN is designed as an accessible and customizable tool for prototyping ideas related to distributed training of DNN models on clusters.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115964576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Fan, Kristopher K. Micinski, Thomas Gilray, Sidharth Kumar
{"title":"Exploring MPI Collective I/O and File-per-process I/O for Checkpointing a Logical Inference Task","authors":"Ke Fan, Kristopher K. Micinski, Thomas Gilray, Sidharth Kumar","doi":"10.1109/IPDPSW52791.2021.00153","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00153","url":null,"abstract":"We present a scalable parallel I/O system for a logical-inferencing application built atop a deductive database. Deductive databases can make logical deductions (i.e. conclude additional facts), based on a set of program rules, derived from facts already in the database. Datalog is a language or family of languages commonly used to specify rules and queries for a deductive database. Applications built using Datalog can range from graph mining (such as computing transitive closure or k-cliques) to program analysis (control and data-flow analysis). In our previous papers, we presented the first implementation of a data-parallel Datalog built using MPI. In this paper, we present a parallel I/O system used to checkpoint and restart applications built on top of our Datalog system. State of the art Datalog implementations, such as Soufflé, only support serial I/O, mainly because the implementation itself does not support many-node parallel execution.Computing the transitive closure of a graph is one of the simplest logical-inferencing applications built using Datalog; we use it as a micro-benchmark to demonstrate the efficacy of our parallel I/O system. Internally, we use a nested B-tree data-structure to facilitate fast and efficient in-memory access to relational data. Our I/O system therefore involves two steps, converting the application data-layout (a nested B-tree) to a stream of bytes followed by the actual parallel I/O. We explore two popular I/O techniques POSIX I/O and MPI collective I/O. For extracting performance out of MPI Collective I/O we use adaptive striping, and for POSIX I/O we use file-per-process I/O. We demonstrate the scalability of our system at up to 4,096 processes on the Theta supercomputer at the Argonne National Laboratory.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127480369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient Approach for Image Border Handling on GPUs via Iteration Space Partitioning","authors":"Bo Qiao, J. Teich, Frank Hannig","doi":"10.1109/IPDPSW52791.2021.00067","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00067","url":null,"abstract":"Border handling is a crucial step in many image processing applications. For stencil kernels such as the Gaussian filter where a window of pixels is required to compute an output pixel, the border of the image needs to be handled differently than the body of the image. To prevent out-of-bounds accesses, conditional statements need to be inserted into the pixel address calculation. This introduces significant overhead, especially on hardware accelerators such as GPUs. Existing research efforts mostly focus on image body computations, while neglecting the importance of border handling or treating it as a corner case. In this paper, we propose an efficient border handling approach for GPUs. Our approach is based on iteration space partitioning, which is a technique similar to index-set splitting, a well-known general-purpose compiler optimization. We present a detailed systematic analysis including an analytic model that quantitatively evaluates the benefits as well as the costs of the transformation. In addition, manually implementing the border handling technique is a tedious task and not portable at all. We integrate our approach into an image processing DSL and a source-to-source compiler called Hipacc to relieve the burden and increase programmers’ productivity. We evaluate over five commonly used image processing applications on two Nvidia GPUs. Results show our proposed approach achieves a geometric mean speedup of up to 87% over a naive implementation.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133781445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Petros Anastasiadis, Sergiy Gogolenko, Nikela Papadopoulou, M. Lawenda, H. Arabnejad, Ali Jahani, Imran Mahmood, D. Groen
{"title":"P-Flee: An Efficient Parallel Algorithm for Simulating Human Migration","authors":"Petros Anastasiadis, Sergiy Gogolenko, Nikela Papadopoulou, M. Lawenda, H. Arabnejad, Ali Jahani, Imran Mahmood, D. Groen","doi":"10.1109/IPDPSW52791.2021.00159","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00159","url":null,"abstract":"With over 79 million people forcibly displaced, forced human migration becomes a common issue in the modern world and a serious challenge for the global community. The Flee is a validated agent-based social simulation framework for forecasting the population displacements in the armed conflict settings. In this paper, we present two schemes to parallelize Flee, analyze computational complexity of those schemes, and outline results for benchmarks of our parallel codes with the real-world and synthetic scenarios on four state-of-the-art systems including a new European pre-exascale system, Hawk. On all testbeds, we evidenced high scalability of our codes. It exceeds more than 16,384 cores in our largest benchmark with 100 million agents on Hawk. Parallelization schemes discussed in this work, can be extrapolated to a wide range of ABSS applications with frequent agent movement and lesser impact of direct communications between agents.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134153924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A self-stabilizing token circulation with graceful handover on bidirectional ring networks","authors":"H. Kakugawa, S. Kamei","doi":"10.1109/IPDPSW52791.2021.00093","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00093","url":null,"abstract":"In self-organizing distributed systems in which there is no centralized controler, cooperation of processes and fault-tolerance are crucial. The former can be formalized by process synchronization, which is one of the fundamental problems in concurrent, parallel and distributed computing. The latter can be formalized by self-stabilization. A self-stabilizing distributed algorithm is a class of fault tolerant distributed algorithms which tolerates finite number of any kind of transient faults. It can be considered as a self-organizing system because it does not need a globally synchronized initialization nor reset, and the system automatically converges to legitimate configuration.In this paper, we propose a self-stabilizing distributed algorithm for token ring with graceful handover on bidirectional ring network with message passing communication model. The motivation of this work is to design a protocol, by a formal approach, which is useful for self-organizing multi-node security camera system that guarantees continuous observation. More specifically, a system consists of several nodes each of which is equipped with a video camera, some of nodes are active in monitoring, and others are inactive to save energy. The problem is to design an algorithm with graceful handover of active nodes. That is, at least one node is active at any time, in other words, there is no time instant at which no node is active. This problem is formalized as the mutual inclusion problem, which is a process synchronization problem such that at least one process is in critical section. To this end, we propose an algorithm for circulating two tokens on bidirectional ring network with the locally shared memory model by extending Dijkstra’s self-stabilizing token ring. We also propose the concept of the model gap tolerance property for graceful handover. The proposed algorithm is self-stabilizing, and it guarantees graceful handover in message passing distributed system.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114951669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. G. Pinto, Lucas Leandro Nesi, M. Miletto, L. Schnorr
{"title":"Providing In-depth Performance Analysis for Heterogeneous Task-based Applications with StarVZ","authors":"V. G. Pinto, Lucas Leandro Nesi, M. Miletto, L. Schnorr","doi":"10.1109/IPDPSW52791.2021.00013","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00013","url":null,"abstract":"Task-based parallelism has adequately addressed the coding complexity required to fully exploit the processing power offered by omnipresent hybrid CPU/GPU supercomputers. However, its performance highly depends on the proper runtime system setup. Analyzing and tuning the performance of task-based applications running on hybrid platforms is challenging since they present unstructured communication and computation overlap, with finer granularity, dynamic scheduling, and inherent irregularity. This paper discusses the StarVZ approach to enable a comprehensive performance analysis in such a heterogeneous context. StarVZ is built on top of modern data analysis tools and is publicly available as an R package. We collect traces from five diverse task-based applications running on top of the StarPU runtime system on a set of multi-node platforms enhanced with GPUs. We demonstrate how it can highlight disturbances that are particularly hard to identify or explain with traditional analysis tools. Additionally, we provide a detailed performance evaluation of StarVZ with different workloads and setups.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115471984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to iWAPT 2021","authors":"","doi":"10.1109/ipdpsw52791.2021.00112","DOIUrl":"https://doi.org/10.1109/ipdpsw52791.2021.00112","url":null,"abstract":"","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115472296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Performance Prediction of Irregular Workloads in Multi-Phase Particle-in-Cell Applications","authors":"Sai P. Chenna, H. Lam, G. Stitt, S. Balachandar","doi":"10.1109/IPDPSW52791.2021.00120","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00120","url":null,"abstract":"The demand for reliable performance prediction of large-scale systems is ever-increasing. With the constant need of application users for faster execution and the expectation on system administrators to efficiently allocate system resources, reliable performance prediction frameworks are crucial for identifying scalability bottlenecks which result in suboptimal performance and poor resource utilization. Such challenges in scalable performance prediction are further exacerbated by irregular applications which present dynamic workload fluctuations across processors. In this paper, we propose a novel trace-driven performance prediction framework to reliably predict the performance of a class of irregular applications that employs the Particle-in-Cell (PIC) method. The framework provides multiple advantages in terms of scalability prediction, algorithm evaluation, and performance tuning. To demonstrate scalability prediction, we predicted the performance of CMT-nek, a large-scale scientific application which employs the PIC method, on Quartz (a DOE HPC system) with an average Mean Absolute Percentage Error (MAPE) of 8.42%. For algorithm evaluation, we evaluated the efficiency of two candidate particle mapping algorithms used in CMT-nek. For performance tuning, we performed a parameter study to assess the impact of a key problem parameter in CMT-nek on application performance.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"344 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116446877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Nakano, Shotaro Aoki, Yasuaki Ito, Akihiko Kasagi
{"title":"On the Computational Power of Convolution Pooling: A Theoretical Approach for Deep Learning","authors":"K. Nakano, Shotaro Aoki, Yasuaki Ito, Akihiko Kasagi","doi":"10.1109/IPDPSW52791.2021.00100","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00100","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely used for image analysis and recognition. For example, LeNet-5 is a 7-layer convectional neural network, which can attain more than 99% test accuracy for classification of handwritten digits. CNNs repeats convolution and pooling operations alternately. However, the computational capability of such operations is not clear. We are curious to know a class of problems that can be solved by CNNs. As a formal approach for this task, we introduce a theoretical parallel computational model of CNNs that we call the convolution-pooling machine. It captures the essence of convolution and pooling operations, and application of non-linear activation functions performed in CNNs. In this paper, we assume the convolution-pooling machine operating on 1-dimensional arrays for simplicity, and focus on the problem of classification of inputs by the distance of two feature points. More specifically, we will design a convolution-pooling machine solving the problem Dk (k≥1), a problem to determine if the distance of the two 1’s is at most k or not. For designing the convolution-pooling machine solving the problem Dk, we generate a mixed-integer linear programming problem (MILP) with constraints and objective functions. We have solved the generated linear programming problem for each Dk (1≤k≤128) by Gurobi optimizer, a commercial MILP solver. We succeeded in finding a solution for all Dk (1 ≤ k ≤ 128) and designing the convolution-pooling machine for solving them. This fact indicates that convolution and pooling operations in CNNs may have the computational capability of classification by the distance of feature points.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122572261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}