Sian Jin, Jesus Pulido, Pascal Grosset, Jiannan Tian, Dingwen Tao, J. Ahrens
{"title":"Adaptive Configuration of In Situ Lossy Compression for Cosmology Simulations via Fine-Grained Rate-Quality Modeling","authors":"Sian Jin, Jesus Pulido, Pascal Grosset, Jiannan Tian, Dingwen Tao, J. Ahrens","doi":"10.1145/3431379.3460653","DOIUrl":"https://doi.org/10.1145/3431379.3460653","url":null,"abstract":"Extreme-scale cosmological simulations have been widely used by today's researchers and scientists on leadership supercomputers. A new generation of error-bounded lossy compressors has been used in workflows to reduce storage requirements and minimize the impact of throughput limitations while saving large snapshots of high-fidelity data for post-hoc analysis. In this paper, we propose to adaptively provide compression configurations to compute partitions of cosmological simulations with newly designed post-analysis aware rate-quality modeling. The contribution is fourfold: (1) We propose a novel adaptive approach to select feasible error bounds for different partitions, showing the possibility and efficiency of adaptively configuring lossy compression for each partition individually. (2) We build models to estimate the overall loss of post-analysis result due to lossy compression and to estimate compression ratio, based on the property of each partition. (3) We develop an efficient optimization guideline to determine the best-fit configuration of error bounds combination in order to maximize the compression ratio under acceptable post-analysis quality loss. (4) Our approach introduces negligible overheads for feature extraction and error-bound optimization for each partition, enabling post-analysis-aware in situ lossy compression for cosmological simulations. Experiments show that our proposed models are highly accurate and reliable. Our fine-grained adaptive configuration approach improves the compression ratio of up to 73% on the tested datasets with the same post-analysis distortion with only 1% performance overhead.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126489515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carl Pearson, Kun Wu, I. Chung, Jinjun Xiong, Wen-mei W. Hwu
{"title":"TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes","authors":"Carl Pearson, Kun Wu, I. Chung, Jinjun Xiong, Wen-mei W. Hwu","doi":"10.1145/3431379.3460645","DOIUrl":"https://doi.org/10.1145/3431379.3460645","url":null,"abstract":"MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. This work first presents a novel datatype handling strategy for nested strided datatypes, which finds a middle ground between the specialized or generic handling in prior work. This work also shows that the performance characteristics of non-contiguous data handling can be modeled with empirical system measurements, and used to transparently improve MPI_Send/Recv latency. Finally, despite substantial attention to non-contiguous GPU data and CUDA-aware MPI implementations, good performance cannot be taken for granted. This work demonstrates its contributions through an MPI interposer library, TEMPI. TEMPI can be used with existing MPI deployments without system or application changes. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242000x and MPI_Send speedup of up to 59000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 917x in a 3D halo exchange with 3072 processes.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"199 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125866838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Rajesh, H. Devarajan, Jaime Cernuda Garcia, Keith Bateman, Luke Logan, Jie Ye, Anthony Kougkas, Xian-He Sun
{"title":"Apollo:","authors":"N. Rajesh, H. Devarajan, Jaime Cernuda Garcia, Keith Bateman, Luke Logan, Jie Ye, Anthony Kougkas, Xian-He Sun","doi":"10.1145/3431379.3460640","DOIUrl":"https://doi.org/10.1145/3431379.3460640","url":null,"abstract":"Applications and middleware services, such as data placement engines, I/O scheduling, and prefetching engines, require low-latency access to telemetry data in order to make optimal decisions. However, typical monitoring services store their telemetry data in a database in order to allow applications to query them, resulting in significant latency penalties. This work presents Apollo: a low-latency monitoring service that aims to provide applications and middleware libraries with direct access to relational telemetry data. Monitoring the system can create interference and overhead, slowing down raw performance of the resources for the job. However, having a current view of the system can aid middleware services in making more optimal decisions which can ultimately improve the overall performance. Apollo has been designed from the ground up to provide low latency, using Publish-Subscriber Pub-Sub semantics, and low overhead, using adaptive intervals in order to change the length of time between polling the resource for telemetry data and machine learning in order to predict changes to the telemetry data between actual resource polling. This work also provides some high level abstractions called I/O curators, which can further aid middleware libraries and applications to make optimal decisions. Evaluations showcase that Apollo can achieve sub-millisecond latency for acquiring complex insights with a memory overhead of ~57 MB and CPU overhead being only 7% more than existing state-of-the-art systems.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"9 Suppl 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125318396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Kahira, Truong Thao Nguyen, L. Bautista-Gomez, Ryousei Takano, R. Badia, M. Wahib
{"title":"An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks","authors":"A. Kahira, Truong Thao Nguyen, L. Bautista-Gomez, Ryousei Takano, R. Badia, M. Wahib","doi":"10.1145/3431379.3460644","DOIUrl":"https://doi.org/10.1145/3431379.3460644","url":null,"abstract":"Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125961820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dakota Fulp, Alexandra Poulos, R. Underwood, John C. Calhoun
{"title":"ARC","authors":"Dakota Fulp, Alexandra Poulos, R. Underwood, John C. Calhoun","doi":"10.2307/j.ctt9qh6q1.22","DOIUrl":"https://doi.org/10.2307/j.ctt9qh6q1.22","url":null,"abstract":"","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115259986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cache-aware Sparse Patterns for the Factorized Sparse Approximate Inverse Preconditioner","authors":"Sergi Laut, R. Borrell, Marc Casas","doi":"10.1145/3431379.3460642","DOIUrl":"https://doi.org/10.1145/3431379.3460642","url":null,"abstract":"Conjugate Gradient is a widely used iterative method to solve linear systems Ax=b with matrix A being symmetric and positive definite. Part of its effectiveness relies on finding a suitable preconditioner that accelerates its convergence. Factorized Sparse Approximate Inverse (FSAI) preconditioners are a prominent and easily parallelizable option. An essential element of a FSAI preconditioner is the definition of its sparse pattern, which constraints the approximation of the inverse A-1. This definition is generally based on numerical criteria. In this paper we introduce complementary architecture-aware criteria to increase the numerical effectiveness of the preconditioner without incurring in significant performance costs. In particular, we define cache-aware pattern extensions that do not trigger additional cache misses when accessing vector x in the y=Ax Sparse Matrix-Vector (SpMV) kernel. As a result, we obtain very significant reductions in terms of average solution time ranging between 12.94% and 22.85% on three different architectures - Intel Skylake, POWER9 and A64FX - over a set of 72 test matrices.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125441144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MPI-CorrBench: Towards an MPI Correctness Benchmark Suite","authors":"Jan-Patrick Lehr, Tim Jammer, C. Bischof","doi":"10.1145/3431379.3460652","DOIUrl":"https://doi.org/10.1145/3431379.3460652","url":null,"abstract":"The Message Passing Interface (MPI) is the de-facto standard for distributed memory computing in high-performance computing (HPC). To aid developers write correct MPI programs, different tools have been proposed, e.g., Intel Trace Analyzer and Collector (ITAC), MUST, Parcoach and MPI-Checker. Unfortunately, the effectiveness of these tools is hard to compare, as they have not been evaluated on a common set of applications. More importantly, well-known and widespread benchmarks, which tend to be well-tested and error free, were used for their evaluation. To enable a structured comparison and improve the coverage and reliability of available MPI correctness tools, we propose MPI-CorrBench as a common test harness. MPI-CorrBench enables a structured comparison of the different tools available w.r.t. various types of errors. In our evaluation, we use MPI-CorrBench to provide a well-defined set of error-cases to MUST, ITAC, Parcoach and MPI-Checker. In particular, we find that ITAC and MUST complement each other in many cases. In general, MUST works better for detecting type errors while ITAC is better in detecting errors in non-blocking operations. Although the most-used functions of MPI are well supported, MPI-CorrBench shows that for one sided communication, the error detection capability of all evaluated tools needs improvement. Moreover, our experiments reveal a MPI standard violation in the MPICH test suite as well as several cases of discouraged use of MPI functionality.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131503647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hang Huang, J. Rao, Song Wu, Hai Jin, Hong Jiang, Hao Che, Xiaofeng Wu
{"title":"Towards Exploiting CPU Elasticity via Efficient Thread Oversubscription","authors":"Hang Huang, J. Rao, Song Wu, Hai Jin, Hong Jiang, Hao Che, Xiaofeng Wu","doi":"10.1145/3431379.3460641","DOIUrl":"https://doi.org/10.1145/3431379.3460641","url":null,"abstract":"Elasticity is an essential feature of cloud computing, which allows users to dynamically add or remove resources in response to workload changes. However, building applications that truly exploit elasticity is non-trivial. Traditional applications need to be modified to efficiently utilize variable resources. This paper explores thread oversubscription, i.e., provisioning more threads than the available cores, to exploit CPU elasticity in the cloud. While maintaining sufficient concurrency allows applications to utilize additional CPUs when more are made available, it is widely believed that thread oversubscription introduces prohibitive overheads due to excessive context switches, loss of locality, and contention on shared resources. In this paper, we conduct a comprehensive study of the overhead of thread oversubscription. We find that 1) the direct cost of context switching (i.e., 1-2 μs on modern processors) does not cause noticeable performance slow down to most applications; 2) oversubscription can be both constructive and destructive to the performance of CPU caches and TLB. We identify two previously under-studied issues that are responsible for drastic slowdowns in many applications under oversubscription. First, the existing thread sleep and wakeup process in the OS kernel is inefficient in handling oversubscribed threads. Second, pervasive busy-waiting operations in program code can waste CPU and starve critical threads. To this end, we devise two OS mechanisms, virtual blocking and busy-waiting detection, to enable efficient thread oversubscription without requiring program code changes. Experimental results show that our approaches can achieve an efficiency close to that in under-subscribed scenarios while preserving the capability to expand to many more CPUs. The performance gain is up to 77% for blocking- and 19x for busy-waiting-based applications compared to the vanilla Linux.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131883010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruobing Chen, Jinping Wu, Haosen Shi, Yusen Li, Xiaoguang Liu, Gang Wang
{"title":"DRLPart: A Deep Reinforcement Learning Framework for Optimally Efficient and Robust Resource Partitioning on Commodity Servers","authors":"Ruobing Chen, Jinping Wu, Haosen Shi, Yusen Li, Xiaoguang Liu, Gang Wang","doi":"10.1145/3431379.3460648","DOIUrl":"https://doi.org/10.1145/3431379.3460648","url":null,"abstract":"Workload consolidation is a commonly used approach for improving resource utilization of commodity servers. However, colocated workloads often suffer from significant performance degradations due to resource contention, which makes resource partitioning an important research problem. Partitioning multiple resources coordinately is particularly challenging due to the complex contention behaviors and huge solution space, which is not well-addressed in the literature. In this paper, we propose a deep reinforcement learning (DRL) framework, named DRLPart, for solving the problem of partitioning multiple resources coordinately. DRLPart learns the optimal partitioning decision from easy-to-collect real-time system state, without need of domain knowledge and handcrafted search heuristics. We solve two critical challenges of applying DRL to the resource partitioning problem. First, we build a deep-learning based performance model, which significantly reduces the training overhead, by estimating the rewards of actions without interacting with real system. Second, we propose a fine-tuning process to improve bad decisions occasionally made by the DRL model, which enhances the adaptivity to new situations. Results from extensive evaluations show that the proposed framework is optimally efficient and robust, which improves the system throughput by 13.3%~18.5 compared to the state-of-the-art baselines.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132182072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Productive Programming of Distributed Systems with the SHAD C++ Library","authors":"Vito Giovanni Castellana, Marco Minutoli","doi":"10.1145/3431379.3462765","DOIUrl":"https://doi.org/10.1145/3431379.3462765","url":null,"abstract":"High-performance computing (HPC) is often perceived as a matter of making large-scale systems (e.g., clusters) run as fast as possible, regardless the required programming effort. However, the idea of \"bringing HPC to the masses\" has recently emerged. Inspired by this vision, we have designed SHAD, the Scalable High-performance Algorithms and Data-structures library [1][6]. SHAD is open source software, written in C++, for C++ developers. Unlike other HPC libraries for distributed systems, which rely on SPMD models, SHAD adopts a shared-memory programming abstraction, to make C++ programmers feel at home. Underneath, SHAD manages tasking and data-movements, moving the computation where data resides and taking advantage of asynchrony to tolerate network latency. At the bottom of his stack, SHAD can interface with multiple runtime systems: this not only improves developer's productivity, by hiding the complexity of such software and of the underlying hardware, but also greatly enhance code portability. Thanks to its abstraction layers, SHAD can indeed target different systems, ranging from laptops to HPC clusters, without any need for modifying the user-level code.We have prototyped and open-sourced the implementation of (a subset of) the C++ standard library (STL) targeting multi-node HPC clusters. Our work allows plain STL-based C++ code to scale on HPC systems, with no need for rewriting the code to exploit the complex hardware. SHAD is available under Apache v2 License at https://github.com/pnnl/SHAD. In this paper we overview the design of the SHAD library, depicting its main components: runtime systems abstractions for tasking; parallel and distributed data-structures; STL-compliant interfaces and algorithms.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114438393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}