Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming最新文献_第3页

Understanding a program's resiliency through error propagation 通过错误传播了解程序的弹性

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441589

Zhimin Li, Harshitha Menon, K. Mohror, P. Bremer, Yarden Livant, Valerio Pascucci

{"title":"Understanding a program's resiliency through error propagation","authors":"Zhimin Li, Harshitha Menon, K. Mohror, P. Bremer, Yarden Livant, Valerio Pascucci","doi":"10.1145/3437801.3441589","DOIUrl":"https://doi.org/10.1145/3437801.3441589","url":null,"abstract":"Aggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault introducing an error that is not readily detected nto an HPC simulation. Due to the insidious nature of SDCs, researchers have worked to understand their impact on applications. Previous studies have relied on expensive fault injection campaigns with uniform sampling to provide overall SDC rates, but this solution does not provide any feedback on the code regions without samples. In this research, we develop a method to systematically analyze all fault injection sites in an application with a low number of fault injection experiments. We use fault propagation data from a fault injection experiment to predict the resiliency of other untested fault sites and obtain an approximate fault tolerance threshold value for each site, which represents the largest error that can be introduced at the site without incurring incorrect simulation results. We define the collection of threshold values over all fault sites in the program as a fault tolerance boundary and propose a simple but efficient method to approximate the boundary. In our experiments, we show our method reduces the number of fault injection samples required to understand a program's resiliency by several orders of magnitude when compared with a traditional fault injection study.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"29 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132663727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Efficiently reclaiming memory in concurrent search data structures while bounding wasted memory 有效地回收并发搜索数据结构中的内存，同时限制浪费的内存

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441582

D. Solomon, Adam Morrison

{"title":"Efficiently reclaiming memory in concurrent search data structures while bounding wasted memory","authors":"D. Solomon, Adam Morrison","doi":"10.1145/3437801.3441582","DOIUrl":"https://doi.org/10.1145/3437801.3441582","url":null,"abstract":"Nonblocking data structures face a safe memory reclamation (SMR) problem. In these algorithms, a node removed from the data structure cannot be reclaimed (freed) immediately, as other threads may be about to access it. The goal of an SMR scheme is to minimize the number of removed nodes that cannot be reclaimed---called wasted memory---while imposing low run-time overhead. It is also desirable for an SMR scheme to be self-contained and not require specific OS features. No existing self-contained SMR scheme can guarantee a predetermined bound on wasted memory without imposing significant run-time overhead. In this paper, we introduce margin pointers (MP), the first nonblocking, self-contained SMR scheme featuring both predetermined bounded wasted memory and low run-time overhead. MP targets search data structures, such as binary trees and skip lists, which are important SMR clients and also victims of its high overhead. MP's novelty lies in its protecting logical subsets of the data structure from being reclaimed, as opposed to previous work, which protects physical locations (explicit nodes).","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132945136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

BiPart

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441611

Sepideh Maleki, U. Agarwal, Martin Burtscher, K. Pingali

引用次数: 11

Are dynamic memory managers on GPUs slow?: a survey and benchmarks gpu上的动态内存管理器慢吗?:调查和基准

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441612

Martin Winter, Mathias Parger, Daniel Mlakar, M. Steinberger

{"title":"Are dynamic memory managers on GPUs slow?: a survey and benchmarks","authors":"Martin Winter, Mathias Parger, Daniel Mlakar, M. Steinberger","doi":"10.1145/3437801.3441612","DOIUrl":"https://doi.org/10.1145/3437801.3441612","url":null,"abstract":"Dynamic memory management on GPUs is generally understood to be a challenging topic. On current GPUs, hundreds of thousands of threads might concurrently allocate new memory or free previously allocated memory. This leads to problems with thread contention, synchronization overhead and fragmentation. Various approaches have been proposed in the last ten years and we set out to evaluate them on a level playing field on modern hardware to answer the question, if dynamic memory managers are as slow as commonly thought of. In this survey paper, we provide a consistent framework to evaluate all publicly available memory managers in a large set of scenarios. We summarize each approach and thoroughly evaluate allocation performance (thread-based as well as warp-based), and look at performance scaling, fragmentation and real-world performance considering a synthetic workload as well as updating dynamic graphs. We discuss the strengths and weaknesses of each approach and provide guidelines for the respective best usage scenario. We provide a unified interface to integrate any of the tested memory managers into an application and switch between them for benchmarking purposes. Given our results, we can dispel some of the dread associated with dynamic memory managers on the GPU.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123397340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Efficiently running SpMV on long vector architectures 在长向量架构上高效运行SpMV

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441592

Constantino Gómez, F. Mantovani, E. Focht, Marc Casas

引用次数: 20

A lock-free relaxed concurrent queue for fast work distribution 用于快速工作分配的无锁宽松并发队列

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441583

Giorgos Kappes, S. Anastasiadis

引用次数: 5

Extending MapReduce framework with locality keys 使用本地键扩展MapReduce框架

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441607

Yifeng Chen, Bei Wang, Xiaolin Wang

引用次数: 1

Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory Sparta:异构内存上的高性能、元素明智的稀疏张量收缩

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441581

Jiawen Liu, Jie Ren, R. Gioiosa, Dong Li, Jiajia Li

{"title":"Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory","authors":"Jiawen Liu, Jie Ren, R. Gioiosa, Dong Li, Jiajia Li","doi":"10.1145/3437801.3441581","DOIUrl":"https://doi.org/10.1145/3437801.3441581","url":null,"abstract":"Sparse tensor contractions appear commonly in many applications. Efficiently computing a two sparse tensor product is challenging: It not only inherits the challenges from common sparse matrix-matrix multiplication (SpGEMM), i.e., indirect memory access and unknown output size before computation, but also raises new challenges because of high dimensionality of tensors, expensive multi-dimensional index search, and massive intermediate and output data. To address the above challenges, we introduce three optimization techniques by using multi-dimensional, efficient hashtable representation for the accumulator and larger input tensor, and all-stage parallelization. Evaluating with 15 datasets, we show that Sparta brings 28 -- 576× speedup over the traditional sparse tensor contraction with sparse accumulator. With our proposed algorithm- and memory heterogeneity-aware data management, Sparta brings extra performance improvement on the heterogeneous memory with DRAM and Intel Optane DC Persistent Memory Module (PMM) over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 30.7% (up to 98.5%), 10.7% (up to 28.3%) and 17% (up to 65.1%) respectively.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115654314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

EGEMM-TC: accelerating scientific computing on tensor cores with extended precision EGEMM-TC:扩展精度的张量核加速科学计算

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441599

Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, Yufei Ding

{"title":"EGEMM-TC: accelerating scientific computing on tensor cores with extended precision","authors":"Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, Yufei Ding","doi":"10.1145/3437801.3441599","DOIUrl":"https://doi.org/10.1145/3437801.3441599","url":null,"abstract":"Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision requirements. In this paper, we build Emulated GEMM on Tensor Cores (EGEMM-TC) to extend the usage of Tensor Cores to accelerate scientific computing applications without compromising the precision requirements. First, EGEMM-TC employs an extendable workflow of hardware profiling and operation design to generate a lightweight emulation algorithm on Tensor Cores with extended-precision. Second, EGEMM-TC exploits a set of Tensor Core kernel optimizations to achieve high performance, including the highly-efficient tensorization to exploit the Tensor Core memory architecture and the instruction-level optimizations to coordinate the emulation computation and memory access. Third, EGEMM-TC incorporates a hardware-aware analytic model to offer large flexibility for automatic performance tuning across various scientific computing workloads and input datasets. Extensive evaluations show that EGEMM-TC can achieve on average 3.13× and 11.18× speedup over the cuBLAS kernels and the CUDA-SDK kernels on CUDA Cores, respectively. Our case study on several scientific computing applications further confirms that EGEMM-TC can generalize the usage of Tensor Cores and achieve about 1.8× speedup compared to the hand-tuned, highly-optimized implementations running on CUDA Cores.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122437337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Modernizing parallel code with pattern analysis 使用模式分析实现并行代码的现代化

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441603

Roberto Castañeda Lozano, M. Cole, Björn Franke

引用次数: 1