M. Riedel, Rocco Sedona, C. Barakat, Pétur Helgi Einarsson, R. Hassanian, Gabriele Cavallaro, Matthias Book, Helmut Neukirchen, A. Lintermann
{"title":"Practice and Experience in using Parallel and Scalable Machine Learning with Heterogenous Modular Supercomputing Architectures","authors":"M. Riedel, Rocco Sedona, C. Barakat, Pétur Helgi Einarsson, R. Hassanian, Gabriele Cavallaro, Matthias Book, Helmut Neukirchen, A. Lintermann","doi":"10.1109/IPDPSW52791.2021.00019","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00019","url":null,"abstract":"We observe a continuously increased use of Deep Learning (DL) as a specific type of Machine Learning (ML) for data-intensive problems (i.e., ’big data’) that requires powerful computing resources with equally increasing performance. Consequently, innovative heterogeneous High-Performance Computing (HPC) systems based on multi-core CPUs and many-core GPUs require an architectural design that addresses end user communities’ requirements that take advantage of ML and DL. Still the workloads of end user communities of the simulation sciences (e.g., using numerical methods based on known physical laws) needs to be equally supported in those architectures. This paper offers insights into the Modular Supercomputer Architecture (MSA) developed in the Dynamic Exascale Entry Platform (DEEP) series of projects to address the requirements of both simulation sciences and data-intensive sciences such as High Performance Data Analytics (HPDA). It shares insights into implementing the MSA in the Jülich Supercomputing Centre (JSC) hosting Europe No. 1 Supercomputer Jülich Wizard for European Leadership Science (JUWELS). We augment the technical findings with experience and lessons learned from two application communities case studies (i.e., remote sensing and health sciences) using the MSA with JUWELS and the DEEP systems in practice. Thus, the paper provides details into specific MSA design elements that enable significant performance improvements of ML and DL algorithms. While this paper focuses on MSA-based HPC systems and application experience, we are not losing sight of advances in Cloud Computing (CC) and Quantum Computing (QC) relevant for ML and DL.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127469879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Cryptanalytic Applications with Stochastic Runtimes on GPUs","authors":"Lena Oden, J. Keller","doi":"10.1109/IPDPSW52791.2021.00077","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00077","url":null,"abstract":"We investigate cryptanalytic applications comprised of many independent tasks that exhibit a stochastic runtime distribution. We compare four algorithms for executing such applications on GPUs. We demonstrate that for different distributions, problem sizes, and platforms the best strategy varies. We support our analytic results by extensive experiments on two different GPUs, from different sides of the performance spectrum: A high performance GPU (Nvidia Volta) and an energy saving system on chip (Jetson Nano).","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125437367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jesús Cámara, José-Carlos Cano, J. Cuenca, Toshiyuki Maeda, Mariano Saura-Sánchez, Lewis Tseng, A. Wakatani, Martina Barnas
{"title":"EduPar Virtual Poster Session","authors":"Jesús Cámara, José-Carlos Cano, J. Cuenca, Toshiyuki Maeda, Mariano Saura-Sánchez, Lewis Tseng, A. Wakatani, Martina Barnas","doi":"10.1109/IPDPSW52791.2021.00060","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00060","url":null,"abstract":"This paper provides an overview of posters accepted for the EduPar 21 poster session. Poster sessions have been an important part of the EduPar workshops, providing an opportunity to facilitate interactions and fostering the community. After a hiatus caused by the COVID-19 pandemic we decided to resume the poster session tradition and to hold the first virtual poster session in EduPar’s eleven years history.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132982868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Teaching Complex Scheduling Algorithms","authors":"S. Hunold, Bartłomiej Przybylski","doi":"10.1109/IPDPSW52791.2021.00058","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00058","url":null,"abstract":"We introduce Scheduling.jland show how it can be used for teaching the basics of scheduling theory to Computer Science students. In particular, our course focuses on scheduling algorithms for parallel, identical machines. For these problems, approximation algorithms and approximation schemes exist. However, we believe that students better understand advantages as well as disadvantages of these approximation algorithms when they investigate their implementations and examine how the algorithms work in practice. For that purpose, we have implemented a set of heuristics and approximation algorithms on top of Scheduling.jl. In the present article, we go through some of the implemented algorithms and explain why we believe these algorithms are particularly helpful for students to understand the basic concepts of approximation algorithms. In our experience, students remember algorithmic details much better if we show them examples using Scheduling.jl.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"293 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133097217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gulsum Gudukbay, J. Gunasekaran, Yilin Feng, M. Kandemir, A. Nekrutenko, C. Das, P. Medvedev, B. Grüning, Nate Coraor, Nathan P Roach, E. Afgan
{"title":"GYAN: Accelerating Bioinformatics Tools in Galaxy with GPU-Aware Computation Mapping","authors":"Gulsum Gudukbay, J. Gunasekaran, Yilin Feng, M. Kandemir, A. Nekrutenko, C. Das, P. Medvedev, B. Grüning, Nate Coraor, Nathan P Roach, E. Afgan","doi":"10.1109/IPDPSW52791.2021.00037","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00037","url":null,"abstract":"Galaxy is an open-source web-based framework that is widely used for performing computational analyses in diverse application domains, such as genome assembly, computational chemistry, ecology, and epigenetics, to name a few. The current Galaxy software framework runs on several high-performance computing platforms such as on-premise clusters, public data centers, and national lab supercomputers. These infrastructures also provide support for state-of-the-art accelerators like Graphical Processing Units (GPUs). When coupled with accelerator support, the tools executing in Galaxy can benefit from massive performance gains in terms of computation time, thereby allowing a more robust computational analysis environment for researchers. Despite tools having GPU capabilities, the current Galaxy framework does not support GPUs, and thus prevents tools from taking advantage of the performance benefits offered by GPUs. We present and experimentally evaluate GYAN, a GPU-aware computation mapping and orchestration functionality implemented in Galaxy that allows the Galaxy tools to be executed on a GPU-enabled cluster. GYAN has the capability of identifying GPU-supported tools and scheduling them on single or multiple GPU nodes based on the availability in the cluster. GYAN supports both native and containerized tool execution. We performed extensive evaluations of the implementation using popular bio-engineering tools to demonstrate the benefits of using GPU technologies. For example, the Racon consensus tool executes ~2× faster than the regular baseline CPU-only jobs, while the Bonito base calling tool shows ~50× speedup.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133408309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Venkatesh, Tony Mason, Pradeep R. Fernando, G. Eisenhauer, Ada Gavrilovska
{"title":"Scheduling HPC Workflows with Intel Optane Persistent Memory","authors":"R. Venkatesh, Tony Mason, Pradeep R. Fernando, G. Eisenhauer, Ada Gavrilovska","doi":"10.1109/IPDPSW52791.2021.00017","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00017","url":null,"abstract":"HPC workloads and their Increasing data processing demands have led to using in situ execution, which couples simulation and analytics to reduce cross node memory accesses and their negative impact on overall performance. In situ executions can benefit from new classes of persistent memory technologies, such as Intel® Optane™ DC Persistent Memory (PMEM), which provide a denser, lower cost, and lower performance memory option for server class machines. However, PMEM creates a new set of trade-offs that must be considered to further improve performance for these HPC workloads and to realize the expected benefits. Prior work has only focused on describing how to tune for a single workload component, which may not yield optimal results for the entire workload.In this paper, we use a suite of workflows with different characteristics to understand the impact of using PMEM for in situ workflow executions with respect to different decisions on how PMEM is shared. Based on our experimental observations, we make recommendations for the considerations that must be incorporated for future workflow schedulers to maximize the benefits of the PMEM resource.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"46 3-4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131452741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danielle Tchuinkou Kwadjo, Joel Mandebi Mbongue, C. Bobda
{"title":"Exploring a Layer-based Pre-implemented Flow for Mapping CNN on FPGA","authors":"Danielle Tchuinkou Kwadjo, Joel Mandebi Mbongue, C. Bobda","doi":"10.1109/IPDPSW52791.2021.00025","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00025","url":null,"abstract":"Convolutional Neural Networks are compute-intensive learning models that have demonstrated ability and effectiveness in solving complex learning problems. However, developing a high-performance FPGA accelerator for CNN often demands high programming skills, hardware verification, precise distribution localization, and long development cycles. Besides, CNN depth increases by reuse and replication of multiple layers. This paper proposes a programming flow for CNN on FPGA to generate high-performance accelerators by assembling CNN pre-implemented components as a puzzle based on the graph topology. Using pre-implemented components allows us to use the minimum of resources necessary, predict the performance, and gain in productivity since there is no need to synthesize any HDL code. Furthermore, components can be reused for a different range of applications. Through prototyping, we demonstrated the viability and relevance of our approach. Experiments show a productivity improvement of up to 69% compared to a traditional FPGA implementation while achieving over 1.75× higher Fmax with lower resources and power consumption.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132205760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Data Parallelism Code Restructuring for HLS Targeting FPGAs","authors":"Renato Campos, João MP Cardoso","doi":"10.1109/IPDPSW52791.2021.00029","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00029","url":null,"abstract":"FPGAs have emerged as hardware accelerators, and in the last decade, researchers have proposed new languages and frameworks to improve the efficiency when mapping computations to FPGAs. One of the main tasks when considering the mapping of software code to FPGAs is code restructuring. Code restructuring is of paramount importance to achieve efficient FPGA-based accelerators, and its automation continues to be a challenge. This paper describes our recent work on techniques to automatically restructure and annotate C code with directives optimized for HLS targeting FPGAs. The input of our approach consists of an unfolded dataflow graph (DFG), currently obtained by a trace of the program’s execution, and restructured C code with HLS directives as output. Specifically, in this paper we propose algorithms to optimize the input DFGs and use isomorphic graph detection for exposing data-level parallelism. The experimental results show that our approach is able to generate efficient FPGA implementations, with significant speedups over the input unmodified source codes, and very competitive to implementations obtained by manual optimizations and by previous approaches. Furthermore, the experiments show that, using our approach, it is possible to extract data-parallelism in linear to quadratic time with respect to the number of nodes of the input DFG.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114132811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to SNACS 2021","authors":"","doi":"10.1109/ipdpsw52791.2021.00121","DOIUrl":"https://doi.org/10.1109/ipdpsw52791.2021.00121","url":null,"abstract":"","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123801567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}