{"title":"“Crosscutting Themes in Computer Science: Where Does PDC Education Fit?”","authors":"R. Raj","doi":"10.1109/IPDPSW55747.2022.00063","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00063","url":null,"abstract":"Since 1968, ACM and IEEE Computer Society have jointly led the development of curricular guidelines in various computing disciplines, starting with Computer Science (CS). The last major release of the undergraduate CS Curriculum Guidelines (CS2013) recognized 18 knowledge areas underpinning the discipline; the next decennial release is also likely to have the same number of knowledge areas. Viewing these knowledge areas as distinct silos does disservice to their interconnected nature, especially as crosscutting or recurring themes run across them and help to unify fundamental concepts in the CS discipline. In this talk, I will discuss crosscutting themes as providing an orthogonal view of the CS discipline, a view girded by knowledge and experience gained over the past 50 years. Providing explicit instruction in the presence and variety of crosscutting themes in CS will help students see each area not just in a silo of insular ideas, but also as part of the ethos of the discipline. I will use examples from the different knowledge areas to show where Parallel and Distributed Computing could fit into CS.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122734933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Throughput-oriented and Accuracy-aware DNN Training with BFloat16 on GPU","authors":"Zhen Xie, Siddhisanket Raskar, M. Emani","doi":"10.1109/IPDPSW55747.2022.00176","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00176","url":null,"abstract":"Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and achieved extraordinary success in many areas. The training of DNNs is commonly compute and memory-intensive, which has resulted in several optimizations in the training phase. Among them, reduced precision is a typical and widely used technique to accelerate DNN training and reduce memory requirements. However, applying a widely adopted reduced precision format such as Float16 to all involved operations in DNN training is not optimal as the use of Float16 in some operations can hurt model accuracy. Meanwhile, additional optimizations including loss scaling and autocast techniques can mitigate the accuracy loss but lead to inherent overhead and inadequate use of reduced precision. In this work, we leverage another reduced precision format, BFloat16, and introduce a throughput-oriented and accuracy-aware approach to maximize the performance potential of DNN training. Since the high throughput provided by BFloat16 format is accompanied by low precision of the floating-point representation, this approach achieves high throughput by using BFloat16 on all DNN op-erations and avoids the accuracy loss through a customized accuracy-aware normalization. Results show that our approach outperforms the state-of-the-art mixed precision training by 1.21x on an NVIDIA A100 GPU.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123294753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Java-based HPC using the MVAPICH2 Library: Early Experiences","authors":"Kinan Al-Attar, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW55747.2022.00091","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00091","url":null,"abstract":"There has been sporadic interest in using Java for High Performance Computing (HPC) in the past. These earlier efforts have resulted in several Java Message Passing Interface (MPI) [1] libraries including mpiJava [2], FastMPJ [3], MPJ Express [4], and Java Open MPI [5]. In this paper, we present our efforts in designing and implementing Java bindings for the MVAPICH2 [6] library. The MVAPICH2 Java bindings (MVAPICH2-J) follow the same API as the Java Open MPI library. MVAPICH2-J also provides support for communicating direct New I/O (NIO) ByteBuffers and Java arrays. Direct ByteBuffers reside outside JVM heaps and are not subject to the garbage collection. The library implements and utilizes a buffering layer to explicitly manage memory to avoid creating buffers every time a Java array message is communicated. In order to evaluate the performance of MVAPICH2-J and other Java MPI libraries, we also designed and implemented OMB-J that is a Java extension to the popular OSU Micro-Benchmarks suite (OMB) [7]. OMB-J currently supports a range of bench-marks for evaluating point-to-point and collective communication primitives. We also added support for communicating direct ByteBuffers and Java arrays. Our evaluations reveal that at the OMB-J level, ByteBuffers are superior in performance due to the elimination of extra copying between the Java and the Java Native Interface (JNI) layer. MVAPICH2-J achieves similar performance to Java Open MPI for ByteBuffers in point-to-point communication primitives that is evaluated using latency and bandwidth benchmarks. For Java arrays, there is a slight overhead for MVAPICH2-J due to the use of the buffering layer. For the collective communication benchmarks, we observe good performance for MVAPICH2-J. Where, MVAPICH2-J fairs better than Java Open MPI with ByteBuffers by $a$ factor of 6.2 and 2.76 for broadcast and all reduce, respectively, on average for all messages sizes. And, using Java arrays, $2. 2times$ and $1. 62times$ on average for broadcast and allreduce, respectively. The collective communication performance is dictated by the performance of the respective native MPI libraries.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123480112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fully Dynamic Line Maintenance by Hybrid Programmable Matter","authors":"Nooshin Nokhanji, P. Flocchini, N. Santoro","doi":"10.1109/IPDPSW55747.2022.00087","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00087","url":null,"abstract":"Motivated by the manipulation of nanoscale materials, recent investigations have focused on hybrid systems where passive elements incapable of movement, called tiles, are manipulated by one or more mobile entities, called robots, with limited computational capabilities. Like in most self-organizing systems, the fundamental concern is with the (geometric) shapes created by the position of the tiles; among them, the line is perhaps the most important. The existing investigations have focused on formation of the shape, but not on its reconfiguration following the failure of some of the tiles. In this paper, we study the problem of maintaining a line formation in presence of fully dynamic failures: any tile can stop functioning at any time. We show how this problem can be solved by a group of very simple robots, with the computational power of deterministic finite automata.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125559924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reproducibility of Bioinformatics Tools","authors":"P. Baykal, N. Beerenwinkel, S. Mangul","doi":"10.1109/IPDPSW55747.2022.00046","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00046","url":null,"abstract":"We introduce a fast and scalable method to assess reproducibility of bioinformatics tools. We replace replicates which are cause of data variation by synthetic replicates. To assess reproducibility of bioinformatics tools, we run the tools with two different types of synthetic replicates and compare results obtained from the original data. Results show differences between output obtained from original data and synthetic replicates.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131955332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining Uncore Frequency and Dynamic Power Capping to Improve Power Savings","authors":"Amina Guermouche","doi":"10.1109/IPDPSW55747.2022.00164","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00164","url":null,"abstract":"The US Department of Energy sets a limit of 20 to 30 MW for future exascale machines. In order to control their power consumption, modern processors provide many features. Power capping and uncore frequency scaling are examples of such features which allow to limit the power consumed by a processor. In this paper, we propose to combine dynamic power capping to uncore frequency scaling. We propose DUFP, an extension of DUF, an existing tool which dynamically adapts uncore frequency. DUFP dynamically adapts the processor power cap to the application needs. Finally, just like DUF, DUFP can tolerate performance loss up to a user-defined limit. With a controlled impact on performance, DUFP is able to provide power savings with no energy consumption increase. The evaluation of DUFP shows that it manages to stay within the user-defined slowdown limits for most of the studied applications. Moreover, combining uncore frequency scaling to power capping: (i) improves power consumption by up to 13.98 % with additional energy savings for applications where uncore frequency scaling has a limited impact, (ii) improves power consumption by up to 7.90 % compared to using uncore frequency scaling by itself and (iii) leads to more than 5 % power savings at 5 % tolerated slowdown with no energy consumption increase for most applications.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122334869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating Unified Memory Performance in HIP","authors":"Zheming Jin, J. Vetter","doi":"10.1109/IPDPSW55747.2022.00096","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00096","url":null,"abstract":"Heterogeneous unified memory management between a CPU and a GPU is a major challenge in GPU computing. Recently, unified memory (UM) has been supported by software and hardware components on AMD computing platforms. The support could simplify the complexities of memory management. In this paper, we attempt to have a better understanding of UM by evaluating the performance of UM programs on an AMD MI100 GPU. More specifically, we evaluate data migration using UM against other data transfer techniques for the overall performance of an application, assess the impacts of three commonly used optimization techniques on the kernel execution time of a vector add sample, and compare the performance and productivity of selected benchmarks with and without UM. The performance overhead associated with UM is not trivial, but it can improve programming productivity by reducing lines of code for scientific applications. We aim to present early results and feedback on the UM performance to the vendor.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133866421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reinout Corts, Niek Sterenborg, Nikolaos S. Alachiotis
{"title":"Accelerated LD-based selective sweep detection using GPUs and FPGAs","authors":"Reinout Corts, Niek Sterenborg, Nikolaos S. Alachiotis","doi":"10.1109/IPDPSW55747.2022.00044","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00044","url":null,"abstract":"Selective sweep detection carries theoretical significance and has several practical implications, from explaining the adaptive evolution of a species in an environment to understanding the emergence of viruses from animals, such as SARS-CoV-2, and their transmission from human to human. The plethora of available genomic data for population genetic analyses, however, poses various computational challenges to existing methods and tools, leading to prohibitively long analysis times. In this work, we accelerate LD (Linkage Disequilibrium) - based selective sweep detection using GPUs and FPGAs on personal computers and datacenter infrastructures. LD has been previously efficiently accelerated with both GPUs and FPGAs. However, LD alone cannot serve as an indicator of selective sweeps. Here, we complement previous research with dedicated accelerators for the ω statistic, which is a direct indicator of a selective sweep. We evaluate performance of our accelerator solutions for computing the $w$ statistic and for a complete sweep detection method, as implemented by the open-source software OmegaPlus. In comparison with a single CPU core, the FPGA accelerator delivers up to 57.1× and 61.7× faster computation of the ω statistic and the complete sweep detection analysis, respectively. The respective attained speedups by the GPU-accelerated version of OmegaPlus are 2.9× and 12.9×. The GPU-accelerated implementation is available for download here: https://github.com/MrKzn/omegaplus.git.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"188 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134059060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Batch Parallel Algorithms for Updating PageRank","authors":"Subhajit Sahu, Kishore Kothapalli, D. Banerjee","doi":"10.1109/IPDPSW55747.2022.00186","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00186","url":null,"abstract":"The design and implementation of parallel algorithms for dynamic graph problems is attracting significant research attention in the recent years, driven by numerous applications to social network analysis, neuroscience, and protein interaction networks. One such problem is the computation of PageRank values of vertices in a directed graph. This paper presents two new parallel algorithms for recomputing the PageRank values of vertices in a dynamic graph. Our techniques require the recomputation of the PageRank of only the vertices affected by the insertion/deletion of a batch of edges. We conduct detailed experimental studies of our algorithm on a set of 11 real-world graphs. Our results on Intel Xeon Silver 4116 CPU and NVIDIA Tesla V100 PCIe 16GB GPU indicate that our algorithms outperform static and dynamic update algorithms by $6.1times$: and $8.6times mathbf{on}$ the CPU, and by 9.8×and $9.3timesmathbf{on}$ the GPU respectively. We also compare the performance of the algorithms in batched mode to cumulative single-edge updates.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132045973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}