{"title":"Evaluating CUDA Portability with HIPCL and DPCT","authors":"Zheming Jin, J. Vetter","doi":"10.1109/IPDPSW52791.2021.00065","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00065","url":null,"abstract":"HIPCL is expanding the scope of the CUDA portability route from an AMD platform to an OpenCL platform. In the meantime, the Intel DPC++ Compatibility Tool (DPCT) is migrating a CUDA program to a data parallel C++ (DPC++) program. Towards the goal of portability enhancement, we evaluate the performance of the CUDA applications from Rodinia, SHOC, and proxy applications ported using HIPCL and DPCT on Intel GPUs. After profiling the ported programs, we aim to understand their performance gaps, and optimize codes converted by DPCT to improve their performance. The open-source repository for the CUDA, HIP, and DPCT programs will be useful for the development of a translator.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125227371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lukas Weber, Lukas Sommer, Leonardo Solis-Vasquez, Tobias Vinçon, Christian Knödler, Arthur Bernhardt, Ilia Petrov, Andreas Koch
{"title":"A Framework for the Automatic Generation of FPGA-based Near-Data Processing Accelerators in Smart Storage Systems","authors":"Lukas Weber, Lukas Sommer, Leonardo Solis-Vasquez, Tobias Vinçon, Christian Knödler, Arthur Bernhardt, Ilia Petrov, Andreas Koch","doi":"10.1109/IPDPSW52791.2021.00028","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00028","url":null,"abstract":"Near-Data Processing is a promising approach to overcome the limitations of slow I/O interfaces in the quest to analyze the ever-growing amount of data stored in database systems. Next to CPUs, FPGAs will play an important role for the realization of functional units operating close to data stored in non-volatile memories such as Flash.It is essential that the NDP-device understands formats and layouts of the persistent data, to perform operations in-situ. To this end, carefully optimized format parsers and layout accessors are needed. However, designing such FPGA-based Near-Data Processing accelerators requires significant effort and expertise. To make FPGA-based Near-Data Processing accessible to non-FPGA experts, we will present a framework for the automatic generation of FPGA-based accelerators capable of data filtering and transformation for key-value stores based on simple data-format specifications.The evaluation shows that our framework is able to generate accelerators that are almost identical in performance compared to the manually optimized designs of prior work, while requiring little to no FPGA-specific knowledge and additionally providing improved flexibility and more powerful functionality.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"105 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127407605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quentin G. Anthony, Lang Xu, H. Subramoni, D. Panda
{"title":"Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences","authors":"Quentin G. Anthony, Lang Xu, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW52791.2021.00143","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00143","url":null,"abstract":"Deep Learning (DL) models for super-resolution (DLSR) are an emerging trend in response to the growth of ML/DL applications requiring high-resolution images. DLSR methods have also shown promise in domains such as medical imaging, surveillance, and microscopy. However, DLSR models are extremely computationally demanding, and require unreasonably long training times on modern Volta GPUs. In our experiments, we observed only 10.3 images/second on a single Volta GPU for training EDSR, a state-of-the-art DLSR model for single-image super-resolution. In comparison, a Volta GPU can process 360 images/second while training ResNet-50, a state-of-the-art model for image classification. Therefore, we believe supercomputers provide a good candidate to speed up DLSR model training. In this paper, we select EDSR as the representative DLSR PyTorch model. Further, we introduce Horovod-based distributed EDSR training. However, we observed poor default EDSR scaling performance on the Lassen HPC system at Lawrence Livermore National Laboratory. To investigate the performance degradations, we perform exhaustive communication profiling. These profiling insights are then used to optimize CUDA-Aware MPI for DLSR models by ensuring advanced MPI designs involving CUDA IPC and registration caching are properly applied by DL frameworks. We present a comprehensive scaling study of EDSR with MVAPICH2-GDR and NCCL up to 512 GPUs on Lassen. We demonstrate an improvement in scaling efficiency by 15.6% over default Horovod training, which translates to a 1.26× speedup in training performance.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130221115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to AsHES 2021","authors":"","doi":"10.1109/ipdpsw52791.2021.00072","DOIUrl":"https://doi.org/10.1109/ipdpsw52791.2021.00072","url":null,"abstract":"","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128893257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the GrAPL 2021 Workshop Chairs","authors":"","doi":"10.1109/ipdpsw52791.2021.00043","DOIUrl":"https://doi.org/10.1109/ipdpsw52791.2021.00043","url":null,"abstract":"","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125386806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dovado: An Open-Source Design Space Exploration Framework","authors":"D. Paletti, Davide Conficconi, M. Santambrogio","doi":"10.1109/IPDPSW52791.2021.00027","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00027","url":null,"abstract":"Traditional hardware development exploits description languages such as VHDL and (System)Verilog to produce highly parametrizable RTL designs. Different parameter values yield different utilization-frequency trade-offs, and hand-tuning is not feasible with a non-trivial amount of parameters. Generally, the Computer-Aided Design (CAD) literature proposes approaches that mainly tackle automatic exploration without combining a design automation feature. Hence, this work proposes Dovado, an open-source CAD tool for design space exploration (DSE) tailored for FPGAs-based designs. Starting from VHDL/(System)Verilog, Dovado exploits Vivado and supports the hardware developer for an exact exploration of a given set of parameters or a DSE where it returns the non-dominated set of configuration points. In this work, we exploit a multi-objective integer formulation and Non-Dominated Sorting Genetic Algorithm (NSGA)-II for a fast DSE. Moreover, we propose an approximation model for the NSGA-II fitness function to decide whether Vivado or a Nadaraya-Watson model should estimate the optimization metrics.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121920151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data-Intensive Computing Modules for Teaching Parallel and Distributed Computing","authors":"M. Gowanlock, Benoît Gallet","doi":"10.1109/IPDPSW52791.2021.00062","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00062","url":null,"abstract":"Parallel and distributed computing (PDC) has found a broad audience that exceeds the traditional fields of computer science. This is largely due to the increasing computational demands of many engineering and domain science research objectives. Thus, there is a demonstrated need to train students with and without computer science backgrounds in core PDC concepts. Given the rise of data science and other data-enabled computational fields, we propose several data-intensive pedagogic modules that are used to teach PDC using message-passing programming with the Message Passing Interface (MPI). These modules employ activities that are common in database systems and scientific workflows that are likely to be employed by domain scientists. Our hypothesis is that using application-driven pedagogic materials facilitates student learning by providing the context needed to fully appreciate the goals of the activities.We evaluated the efficacy of using the data-intensive pedagogic modules to teach core PDC concepts using a sample of graduate students enrolled in a high performance computing course at Northern Arizona University. In the sample, only 30% of students have a traditional computer science background. We found that the hands-on application-driven approach was generally successful at helping students learn core PDC concepts.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124336263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PIGO: A Parallel Graph Input/Output Library","authors":"Kasimir Gabert, Ümit V. Çatalyürek","doi":"10.1109/IPDPSW52791.2021.00050","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00050","url":null,"abstract":"Graph and sparse matrix systems are highly tuned, able to run complex graph analytics in fractions of seconds on billion-edge graphs. For both developers and researchers, the focus has been on computational kernels and not end-to-end runtime. Despite the significant improvements that modern hardware and operating systems have made towards input and output, these can still become application bottlenecks. Unfortunately, on high-performance shared-memory graph systems running billion-scale graphs, reading the graph from file systems easily takes over 2000× longer than running the computational kernel. This slowdown causes both a disconnect for end users and a loss of productivity for researchers and developers.We close the gap by providing a simple to use, small, header-only, and dependency-free C++11 library that brings I/O improvements to graph and matrix systems. Using our library, we improve the end-to-end performance for state-of-the-art systems significantly—in many cases by over 40×.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117229336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}