G. Bosilca, Aurélien Bouteiller, T. Hérault, Valentin Le Fèvre, Y. Robert, J. Dongarra
{"title":"Revisiting Credit Distribution Algorithms for Distributed Termination Detection","authors":"G. Bosilca, Aurélien Bouteiller, T. Hérault, Valentin Le Fèvre, Y. Robert, J. Dongarra","doi":"10.1109/IPDPSW52791.2021.00095","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00095","url":null,"abstract":"This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"32 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133740434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CUDAMicroBench: Microbenchmarks to Assist CUDA Performance Programming","authors":"Xinyao Yi, D. Stokes, Yonghong Yan, C. Liao","doi":"10.1109/IPDPSW52791.2021.00068","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00068","url":null,"abstract":"Programming to achieve high performance for NVIDIA GPUs using CUDA has been known to be challenging. A GPU has hundreds or thousands of cores that a program must exhibit sufficient parallelism to achieve maximum GPU utilization. A system with GPU accelerators has a heterogeneous and deep memory system that programmers must effectively and correctly use to fully take advantage of the GPU’s parallelism capability. In this paper, we present CUDAMicroBench, a collection of fourteen microbenchmarks that demonstrate performance challenges in CUDA programming and techniques to optimize the CUDA programs to address these challenges. It also includes examples and techniques for using advanced CUDA features such as data shuffling between threads, dynamic parallelism, etc that can help users optimize the CUDA program for performance. The microbenchmark can be used for evaluating the performance of GPU architectures, the memory systems of GPU itself and of the whole system architectures, and for evaluating the effectiveness of compiler and performance tools for performance analysis. It can be used to help users understand the complexity of heterogeneous GPU-accelerator systems through examples and guide users for performance optimization. It is released as BSD-licensed open-source from https://github.com/passlab/CUDAMicroBench.git.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133417027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Ahmed, David B. Williams-Young, K. Ibrahim, Chao Yang
{"title":"Performance Modeling and Tuning for DFT Calculations on Heterogeneous Architectures","authors":"H. Ahmed, David B. Williams-Young, K. Ibrahim, Chao Yang","doi":"10.1109/IPDPSW52791.2021.00108","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00108","url":null,"abstract":"Tuning scientific code for heterogeneous computing architecture is a growing challenge. Not only do we need to tune the code to multiple architectures, but also we need to select or schedule computations to the most efficient compute variant. In this paper, we explore the tuning and performance modeling question of one of the most time computing kernels in density functional theory calculations on systems with a multicore host CPU accelerated with GPUs. We show the problem configuration dictates the choice of the most efficient compute engine. Such choice could alternate between the host and the accelerator, especially while scaling. As such, a performance model to predict the execution time on the host CPU and GPU is essential to select the compute environment and to achieve optimal performance. We present a simple model that empirically carry out such tasks and could accurately steer the scheduling of computation.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130567644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Breitbach, Janick Edinger, Dominik Schäfer, Christian Becker
{"title":"DataVinci: Proactive Data Placement for Ad-Hoc Computing","authors":"Martin Breitbach, Janick Edinger, Dominik Schäfer, Christian Becker","doi":"10.1109/IPDPSW52791.2021.00129","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00129","url":null,"abstract":"Mobile ad-hoc computing enables applications to offload computationally intensive tasks to end-user devices in proximity. Many state-of-the-art applications such as face recognition, machine learning, or computer vision require large amounts of input data that is shared among multiple tasks. In these use cases, offloading the workload to remote devices becomes more time-consuming and, consequently, less attractive due to the required data transfer. As a solution, a proactive distribution of the data files on potential computational resource providers eliminates the need for ad-hoc data transfers. The characteristics of ad-hoc computing environments necessitate non-trivial data and task placement strategies. In this paper, we propose DataVinci — a data and task scheduler for mobile ad-hoc computing environments. DataVinci determines the number of copies for each data file (replicas), places these replicas proactively on remote devices, and schedules tasks based on the previously created data distribution. It continuously adjusts the number of replicas and balances the trade-off between execution latencies and data transfer overhead. In a large-scale study, we show the effectiveness of DataVinci, which reduces the average task execution time by more than 60 percent compared to an approach without proactive data placement, while keeping the amount of transferred data constant.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126050572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AsHES 2021 Keynote - Addressing Scalability Bottlenecks of DNN Training Through Hardware Heterogeneity: A View from the Perspectives of Memory Capacity and Energy Consumption","authors":"","doi":"10.1109/ipdpsw52791.2021.00073","DOIUrl":"https://doi.org/10.1109/ipdpsw52791.2021.00073","url":null,"abstract":"","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130179552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Shared-Memory Scalable k-Core Maintenance on Dynamic Graphs and Hypergraphs","authors":"Kasimir Gabert, Ali Pinar, Ümit V. Çatalyürek","doi":"10.1109/IPDPSW52791.2021.00158","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00158","url":null,"abstract":"Computing k-cores on graphs is an important graph mining target as it provides an efficient means of identifying a graph’s dense and cohesive regions. Computing k-cores on hypergraphs has seen recent interest, as many datasets naturally produce hypergraphs. Maintaining k-cores as the underlying data changes is important as graphs are large, growing, and continuously modified. In many practical applications, the graph updates are bursty, both with periods of significant activity and periods of relative calm. Existing maintenance algorithms fail to handle large bursts, and prior parallel approaches on both graphs and hypergraphs fail to scale as available cores increase.We address these problems by presenting two parallel and scalable fully-dynamic batch algorithms for maintaining k-cores on both graphs and hypergraphs. Both algorithms take advantage of the connection between k-cores and h-indices. One algorithm is well suited for large batches and the other for small. We provide the first algorithms that experimentally demonstrate scalability as the number of threads increase while sustaining high change rates in graphs and hypergraphs.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130347138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Double Rank-based Multi-workflow Scheduling with Multi-objective Optimization in Cloud Environments","authors":"Feng Li, Moon Gi Seok, Wentong Cai","doi":"10.1109/IPDPSW52791.2021.00015","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00015","url":null,"abstract":"Workflow scheduling in clouds has been extensively researched. Many workflows from different users could be submitted to clouds at the same time and cloud providers should handle them simultaneously. So, it is necessary to consider the problem of scheduling multi-workflow. In addition, cloud computing systems can offer some special features, like Pay-Per-Use and Quality of Service (QoS) over the Internet. The scheduler has to consider the tradeoffs between different QoS parameters in order to satisfy the QoS requirements. Hence, how to schedule multiple heterogeneous workflows in the meanwhile to balance multiple objectives is a big challenge. The majority of the existing multi-workflow scheduling algorithms are based on QoS constrained approaches and attempt to optimize one objective while taking other QoS factors as constraints. Meanwhile, most of the multi-objective optimization scheduling works aim to deal with single-workflow. Conversely, this paper focuses on QoS optimization approaches by finding trade-off schedules to execute multi-workflow on cloud computing resources so as to balance multi-objective. To this end, a new double rank-based task sequencing method is proposed and integrated with a multi-objective heuristic algorithm for multi-workflow scheduling. Different algorithms are evaluated using various well-known real-world workflows and simulated workflows. The performance evaluation results demonstrate that the proposed approach is capable of generating efficient schedules with high quality in terms of meeting multi-objective for multiple workflows.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127256854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quentin Berthet, A. Upegui, L. Gantel, Alexandre Duc, Giulia Traverso
{"title":"An Area-Efficient SPHINCS+ Post-Quantum Signature Coprocessor","authors":"Quentin Berthet, A. Upegui, L. Gantel, Alexandre Duc, Giulia Traverso","doi":"10.1109/IPDPSW52791.2021.00034","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00034","url":null,"abstract":"The significant advances in the area of quantum computing of the past decade leave no doubt about the fact that quantum computers are an actual threat to cryptography. For this reason, a lot of efforts have been made lately in designing so-called post-quantum cryptographic primitives. The adoption of these schemes depends on the future capability of post-quantum cryptographic schemes to offer performances and functionalities similar to their classical counterparts. In particular, a milestone towards standardization is the implementation on FPGA of cryptographic primitives which leads to an efficient execution. We contribute in this respect by providing an area-efficient FPGA implementation of SPHINCS+, a post-quantum signature scheme which guarantees very high security, allowing its deployment into embedded systems such as hardware security modules, IoT devices or nanosatellites.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129717661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}