{"title":"Spray: Sparse Reductions of Arrays in OPENMP","authors":"J. Hückelheim, J. Doerfert","doi":"10.1109/IPDPS49936.2021.00056","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00056","url":null,"abstract":"We present SPRAY, an open-source header-only C++ library for sparse reductions of arrays. SPRAY is meant for applications in which a large array is collaboratively updated by multiple threads using an associative and commutative operation such as +=. Especially when each thread accesses only parts of the array, SPRAY can perform significantly better than OPENMP’s built-in reduction clause or atomic updates, while also using less memory than the former. SPRAY provides both an easy-to-use interface that can serve as a drop-in replacement for OPENMP reductions and a selection of reducer objects that accumulate the final result in different thread-safe ways. We demonstrate SPRAY through multiple test cases including the LULESH shock hydrodynamics code and a transpose-matrix-vector multiplication for sparse matrices stored in CSR format. SPRAY reductions outperform built-in OPENMP reductions consistently, in some cases improving run time and memory overhead by 20X, and even beating domain-specific approaches such as Intel MKL by over 2X in some cases. Furthermore, SPRAY reductions have a minimal impact on the code base, requiring only a few lines of source code changes. Once in place, SPRAY reduction schemes can be switched easily, allowing performance portability and tuning opportunities by separating performance-critical implementation details from application code.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116906912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiwoo Bang, Chungyong Kim, Sunggon Kim, Qichen Chen, Cheongjun Lee, Eun-Kyu Byun, J. Lee, Hyeonsang Eom
{"title":"Finer-LRU: A Scalable Page Management Scheme for HPC Manycore Architectures","authors":"Jiwoo Bang, Chungyong Kim, Sunggon Kim, Qichen Chen, Cheongjun Lee, Eun-Kyu Byun, J. Lee, Hyeonsang Eom","doi":"10.1109/IPDPS49936.2021.00065","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00065","url":null,"abstract":"In HPC systems, the increasing need for a higher level of concurrency has led to packing more cores within a single chip. However, since multiple processes share memory space, the frequent access to resources in critical sections where only atomic operation has to be executed can result in poor performance. In this paper, we focus on reducing lock contention on the memory management system of an HPC manycore architecture. One of the critical sections causing severe lock contention in the I/O path is in the page management system, which uses multiple Least Recently Used (LRU) lists with a single lock instance. To solve this problem, we propose a Finer-LRU scheme, which optimizes the page reclamation process by splitting LRU lists into multiple sub-lists, each having its own lock instance. Our evaluation result shows that the Finer-LRU scheme can improve sequential write throughput by 57.03% and reduce latency by 98.94% compared to the baseline Linux kernel version 5.2.8 in the Intel Knights Landing (KNL) architecture.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126473383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Santos, S. Hari, P. M. Basso, L. Carro, P. Rech
{"title":"Demystifying GPU Reliability: Comparing and Combining Beam Experiments, Fault Simulation, and Profiling","authors":"F. Santos, S. Hari, P. M. Basso, L. Carro, P. Rech","doi":"10.1109/IPDPS49936.2021.00037","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00037","url":null,"abstract":"Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than $5 times$) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134040046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaihua Fu, Wei Zhang, Quan Chen, Deze Zeng, Xingru Peng, Wenli Zheng, M. Guo
{"title":"QoS-Aware and Resource Efficient Microservice Deployment in Cloud-Edge Continuum","authors":"Kaihua Fu, Wei Zhang, Quan Chen, Deze Zeng, Xingru Peng, Wenli Zheng, M. Guo","doi":"10.1109/IPDPS49936.2021.00102","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00102","url":null,"abstract":"User-facing services are now evolving towards the microservice architecture where a service is built by connecting multiple microservice stages. While an entire service is heavy, the microservice architecture shows the opportunity to only offload some microservice stages to the edge devices that are close to the end users. However, emerging techniques often result in the violation of Quality-of-Service (QoS) of microservice-based services in cloud-edge continuum, as they do not consider the communication overhead or the resource contention between microservices.We propose Nautilus, a runtime system that effectively deploys microservice-based user-facing services in cloud-edge continuum. It ensures the QoS of microservice-based user-facing services while minimizing the required computational resources. Nautilus is comprised of a communication-aware microservice mapper, a contention-aware resource manager and a load-aware microservice scheduler. The mapper divides the microservice graph into multiple partitions based on the communication overhead and maps the partitions to the nodes. On each node, the resource manager determines the optimal resource allocation for its microservices based on reinforcement learning that may capture the complex contention behaviors. The microservice scheduler monitors the QoS of the entire service, and migrates microservices from busy nodes to idle ones at runtime. Our experimental results show that Nautilus reduces the computational resource usage by 23.9% and the network bandwidth usage by 53.4%, while achieving the required 99%-ile latency.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132552801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinsu Park, Seongbeom Park, Myeonggyun Han, Woongki Baek
{"title":"PALM: Progress- and Locality-Aware Adaptive Task Migration for Efficient Thread Packing","authors":"Jinsu Park, Seongbeom Park, Myeonggyun Han, Woongki Baek","doi":"10.1109/IPDPS49936.2021.00041","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00041","url":null,"abstract":"Thread packing (TP) is an effective and widely-used technique to significantly improve the efficiency of parallel systems by dynamically controlling the number of cores allocated to multithreaded applications based on their requirements such as performance and energy efficiency. Despite the extensive prior works on TP, little work has been done to investigate and address its performance inefficiencies that arise across various parallel systems and applications with different characteristics. To bridge this gap, we investigate the performance inefficiencies of TP using a wide range of parallel applications and system configurations and identify their root causes. Guided by the in-depth performance characterization results, we propose PALM, progress- and locality-aware adaptive task migration for efficient TP. Through quantitative evaluation, we demonstrate that PALM achieves significantly higher performance and lower energy consumption than TP across various synchronization-intensive applications and system configurations, provides the performance and energy consumption comparable with the thread reduction technique, and considerably improves the efficiency of dynamic server consolidation and the performance under power capping.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127259412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Plex: Scaling Parallel Lexing with Backtrack-Free Prescanning","authors":"Le Li, Shigeyuki Sato, Qiheng Liu, K. Taura","doi":"10.1109/IPDPS49936.2021.00079","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00079","url":null,"abstract":"Lexical analysis, which converts input text into a list of tokens, plays an important role in many applications, including compilation and data extraction from texts. To recognize token patterns, a lexer incorporates a sequential computation model – automaton as its basic building component. As such, it is considered difficult to parallelize due to the inherent data dependency. Much work has been done to accelerate lexical analysis through parallel techniques. Unfortunately, existing attempts mainly rely on language-specific remedies for input segmentation, which makes it not only tricky for language extension, but also challenging for automatic lexer generation. This paper presents Plex – an automated tool for generating parallel lexers from user-defined grammars. To overcome the inherent sequentiality, Plex applies a fast prescanning phase to collect context information prior to scanning. To reduce the overheads brought by prescanning, Plex adopts a special automaton, which is derived from that of the scanner, to avoid backtracking behavior and exploits data-parallel techniques. The evaluation under several languages shows that the prescanning overhead is small, and consequently Plex is scalable and achieves 9.8-11.5X speedups using 18 threads.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133463562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ayaz Akram, Anna Giannakou, V. Akella, Jason Lowe-Power, S. Peisert
{"title":"Performance Analysis of Scientific Computing Workloads on General Purpose TEEs","authors":"Ayaz Akram, Anna Giannakou, V. Akella, Jason Lowe-Power, S. Peisert","doi":"10.1109/IPDPS49936.2021.00115","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00115","url":null,"abstract":"Scientific computing sometimes involves computation on sensitive data. Depending on the data and the execution environment, the HPC (high-performance computing) user or data provider may require confidentiality and/or integrity guarantees. To study the applicability of hardware-based trusted execution environments (TEEs) to enable secure scientific computing, we deeply analyze the performance impact of general purpose TEEs, AMD SEV, and Intel SGX, for diverse HPC benchmarks including traditional scientific computing, machine learning, graph analytics, and emerging scientific computing workloads. We observe three main findings: 1) SEV requires careful memory placement on large scale NUMA machines (1×–3.4× slowdown without and 1×–1.15× slowdown with NUMA aware placement), 2) virtualization—a prerequisite for SEV— results in performance degradation for workloads with irregular memory accesses and large working sets (1×–4× slowdown compared to native execution for graph applications) and 3) SGX is inappropriate for HPC given its limited secure memory size and inflexible programming model (1.2×–126× slowdown over unsecure execution). Finally, we discuss forthcoming new TEE designs and their potential impact on scientific computing.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131554639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Janick Edinger, Mamn Breitbach, Niklas Gabrisch, Dominik Schäfer, Christian Becker, Amr Rizk
{"title":"Decentralized Low-Latency Task Scheduling for Ad-Hoc Computing","authors":"Janick Edinger, Mamn Breitbach, Niklas Gabrisch, Dominik Schäfer, Christian Becker, Amr Rizk","doi":"10.1109/IPDPS49936.2021.00087","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00087","url":null,"abstract":"End users can mutually share their computing resources in ad-hoc computing environments with code offloading. This augments the computational power of resource-constrained mobile devices and enables interactive user-facing applications that would otherwise exceed single device capabilities. However, ad-hoc computing comes along with new challenges such as heterogeneity and unreliability of devices. Resource consumers have to make task scheduling decisions without relying on a centralized scheduler to facilitate sub-second response times in environments with communication latencies that are in the order of the task execution times. In this paper, we present a decentralized low-latency task scheduling approach that minimizes job execution times in heterogeneous ad-hoc environments. We propose two decentralized task scheduling algorithms that select powerful computing resources for parallel task execution while avoiding delays that arise from congested devices. We provide an analytical model of the performance of these algorithms before conducting an extensive evaluation based on real-world applications and a realistic computing infrastructure. Our results show that decentralized scheduling can dynamically adapt to varying system load and outperform a central scheduler in both task and job execution times, which enables low-latency task offloading in ad-hoc environments.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123281410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Systemic Assessment of Node Failures in HPC Production Platforms","authors":"Anwesha Das, F. Mueller, B. Rountree","doi":"10.1109/IPDPS49936.2021.00035","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00035","url":null,"abstract":"Production HPC clusters endure failures reducing computational capability and resource availability. Despite the presence of various failure prediction schemes for large-scale computing systems, a comprehensive understanding of how nodes fail considering various components and layers of the system is required for sustained resilience. This work performs a holistic diagnosis of node failures using a measurement-driven approach on contemporary system logs that can help vendors and system administrators support exascale resilience.Our work shows that external environmental influence is not strongly correlated with node failures in terms of the root cause. Though hardware and software faults trigger failures, the underlying root cause often lies in the application malfunctioning causing the system to fail. Furthermore, lead time enhancements are feasible for nodes showing fail slow characteristics. This study excavates such helpful empirical observations, which could facilitate better failure handling in production systems.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123359850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Mlakar, Martin Winter, Mathias Parger, M. Steinberger
{"title":"Speculative Parallel Reverse Cuthill-McKee Reordering on Multi- and Many-core Architectures","authors":"Daniel Mlakar, Martin Winter, Mathias Parger, M. Steinberger","doi":"10.1109/IPDPS49936.2021.00080","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00080","url":null,"abstract":"Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA’s single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124780178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}