Sabine Roller, P. Strazdins, R. Couturier, N. E. Pour, Suzanne Shontz, T. Rauber, G. Runger, L. Yang
{"title":"Message from the PDSEC-22 Workshop Chairs","authors":"Sabine Roller, P. Strazdins, R. Couturier, N. E. Pour, Suzanne Shontz, T. Rauber, G. Runger, L. Yang","doi":"10.1109/IPDPSW55747.2022.00137","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00137","url":null,"abstract":"Welcome to the 23nd IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC-22), held virtually on June 3rd, 2022 in Lyon, France, in conjunction with the 36th IEEE Inter-national Parallel and Distributed Processing Symposium (IPDPS 2022). This year, the workshop as IPDPS took place virtually due to the covid-19 pandemic.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116302356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Building scalable indexes that can be efficiently queried","authors":"C. Boucher","doi":"10.1109/IPDPSW55747.2022.00034","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00034","url":null,"abstract":"Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. We later showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding --- but they did not say how to find those thresholds. We present another novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences. Compared to existing methods, ours used 2 to 11 times less memory and was 2 to 32 times faster for index construction. Moreover, our method was less than one thousandth the size of competing indexes for large collections of human chromosomes.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"362 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114769898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alok Mishra, Smeet Chheda, Carlos Soto, A. Malik, Meifeng Lin, Barbara M. Chapman
{"title":"COMPOFF: A Compiler Cost model using Machine Learning to predict the Cost of OpenMP Offloading","authors":"Alok Mishra, Smeet Chheda, Carlos Soto, A. Malik, Meifeng Lin, Barbara M. Chapman","doi":"10.1109/IPDPSW55747.2022.00074","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00074","url":null,"abstract":"The HPC industry is inexorably moving towards an era of extremely heterogeneous architectures, with more devices configured on any given HPC platform and potentially more kinds of devices, some of them highly specialized. Writing a separate code suitable for each target system for a given HPC application is not practical. The better solution is to use directive-based parallel programming models such as OpenMP. OpenMP provides a number of options for offloading a piece of code to devices like GPUs. To select the best option from such options during compilation, most modern compilers use analytical models to estimate the cost of executing the original code and the different offloading code variants. Building such an analytical model for compilers is a difficult task that necessi-tates a lot of effort on the part of a compiler engineer. Recently, machine learning techniques have been successfully applied to build cost models for a variety of compiler optimization problems. In this paper, we present COMPOFF, a cost model that statically estimates the Cost of OpenMP OFFloading using a neural network model. We used six different transformations on a parallel code of Wilson Dslash Operator to support GPU offloading, and we predicted their cost of execution on different GPUs using COMPOFF during compile time. Our results show that this model can predict offloading costs with a root mean squared error in prediction of less than 0.5 seconds. Our preliminary findings indicate that this work will make it much easier and faster for scientists and compiler developers to port legacy HPC applications that use OpenMP to new heterogeneous computing environment.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134087049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Volume Estimation for Dynamic Environments using Deep Learning on the Edge","authors":"Chandan Kumar, Yamini Mathur, A. Jannesari","doi":"10.1109/IPDPSW55747.2022.00159","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00159","url":null,"abstract":"The utility of edge devices has increased in volume estimation of uneven terrains. Existing techniques utilize several geo-tagged images of the landscape, captured in-flight by an edge device mounted over a UAV, to generate 3D models and perform volume estimation through manual boundary marking. These methods, although accurate, require significant time, human effort and are heavily dependent on GPS. We present an efficient deep learning framework that detects the object of interest and automatically determines the volume (independent of GPS) of the detected object on-the-fly. Our method employs a stereo camera for depth sensing of the object and overlays a unit mesh grid over the object's boundary to perform volume estimation. We explore the accuracy vs computational complexity trade-off on variations of our technique. Experiments indicate that our method reduces the time for volume estimation by several orders of magnitude in contrast to existing methods and is independent of GPS as well. Also, to the best of our knowledge, this is the first method that can perform volume analysis in a dynamic environment.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116172310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CGRA4HPC 2022 Invited Speaker: Practical, scalable, and easy-to-use CGRA for HPC","authors":"Ilan Tayari","doi":"10.1109/IPDPSW55747.2022.00111","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00111","url":null,"abstract":"NextSilicon has developed technology that allows using large CGRAs for acceleration of HPC applications and workloads, with zero code changes to parallel high-level language codes. Mr. Tayari will present this innovative technology, what it means for CGRA designs, and how it fits in the HPC market.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"185 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121837280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Analysis of Multi-Containerized MD Simulations for Low-Level Resource Allocation","authors":"Shingo Okuno, Akira Hirai, Naoto Fukumoto","doi":"10.1109/IPDPSW55747.2022.00162","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00162","url":null,"abstract":"This study discusses scheduling strategies to maximize ensemble throughput, which is the total throughput of multiple containers running simultaneously. Such a strategy is useful, for example, in ensemble runs of molecular dynamics (MD) simulations. To design the strategies, we need to tackle two major challenges: (1) how many containers and how many threads per container we should allocate, and (2) which low-level resources we should allocate to reflect workload characteristics. In particular, the latter challenge is important and inevitable for performance-sensitive applications because they effectively utilize low-level hardware such as simultaneous multi-threading (SMT) to maximize performance, while most container platforms do not handle the challenge. In this paper, as a preliminary experiment to implement scheduling strategies related to SMT, we examined whether ensemble throughput of MD simulations can be improved by deploying containers on separate logical cores even when they share the same physical cores. As a result, we obtained a 2.22-fold ensemble throughput compared with a one-container execution with 10 physical cores.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"2 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125918663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Siefert, Stephen L. Olivier, G. Voskuilen, Jeffrey Young
{"title":"MultiGrid on FPGA Using Data Parallel C++","authors":"C. Siefert, Stephen L. Olivier, G. Voskuilen, Jeffrey Young","doi":"10.1109/IPDPSW55747.2022.00147","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00147","url":null,"abstract":"Centered on modern C++ and the SYCL standard for heterogeneous programming, Data Parallel C++ (dpc++) and Intel's oneAPI software ecosystem aim to lower the barrier to entry for the use of accelerators like FPGAs in diverse applications. In this work, we consider the usage of FPGAs for scientific computing, in particular with a multigrid solver, MueLu. We report on early experiences implementing kernels of the solver in DPC++ for execution on Stratix 10 FPGAs, and we evaluate several algorithmic design and implementation choices. These choices not only impact performance, but also shed light on the capabilities and limitations of DPC++ and oneAPI.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123494790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"17th IEEE International Workshop on Automatic Performance Tuning (iWAPT2022)","authors":"Che-Rung Lee, S. Ohshima","doi":"10.1109/IPDPSW55747.2022.00148","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00148","url":null,"abstract":"The goal of the Seventeenth International Workshop on Automatic Performance Tuning (iWAPT2022) is to bring together researchers who are investigating automated techniques for constructing and/or adapting algorithms and software for high-performance on modern complex machine architectures. iWAPT is a series of workshops that focus on research and techniques related to performance sustainability issues. The series provides an opportunity for researchers and users of automatic performance tuning (AT) technologies to exchange ideas and experiences acquired when applying such technologies to improve the performance of algorithms, libraries, and applications; in particular, on cutting edge computing platforms. The half-day workshops consist of presentations of research papers. Topics of interest include performance modeling; adaptive algorithms; autotuned numerical algorithms; libraries and scientific applications; empirical compilation; automated code generation; frameworks and theories of AT and software optimization; autonomic computing; and context-aware computing.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123779661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On How to Push Efficient Medical Semantic Segmentation to the Edge: the SENECA approach","authors":"Raffaele Berzoini, E. D’Arnese, Davide Conficconi","doi":"10.1109/IPDPSW55747.2022.00027","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00027","url":null,"abstract":"Semantic segmentation is the process of assigning each input image pixel a value representing a class, and it enables the clustering of pixels into object instances. It is a highly employed computer vision task in various fields such as autonomous driving and medical image analysis. In particular, in medical practice, semantic segmentation identifies different regions of interest within an image, like different organs or anomalies such as tumors. Fully Convolutional Networks (FCNs) have been employed to solve semantic segmentation in different fields and found their way in the medical one. In this context, the low contrast among semantically different areas, the constraint related to energy consumption, and computation resource availability increase the complexity and limit their adoption in daily practice. Based on these considerations, we propose SENECA to bring medical semantic segmentation to the edge with high energy efficiency and low segmentation time while preserving the accuracy. We reached a throughput of 335.4 ± 0.34 frames per second on the FPGA, 4.65× better than its GPU counterpart, with a global dice score of 93.04% ± 0.07 and an improvement in terms of energy efficiency with respect to the GPU of 12.7×.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123836768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Litener: An Accelerator-Enabled Lightweight Container for Edge Computing","authors":"Ryan Dyson, C. Reaño","doi":"10.1109/IPDPSW55747.2022.00158","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00158","url":null,"abstract":"Containers supporting accelerators, such as Docker, include management tasks that require a significant amount of computing resources. While these resources are available in the Cloud, in other scenarios such as the Edge resources are more limited. An accelerator-enabled lightweight container would be desirable in such scenarios. In this paper, we analyse platforms to containerise applications using accelerators, including but not limited to Docker. After this analysis, we present a lightweight container, referred to as Litener, focused on using accelerators in scenarios with limited resources. Although the focus is on accelerators, many of the optimisations described can also be applied to scenarios that do not use accelerators. Experiments show a speedup of up to 7.78X when compared to other platforms.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130274837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}