Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming最新文献_第4页

OrcGC OrcGC

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441596

Andreia Correia, P. Ramalhete, P. Felber

{"title":"OrcGC","authors":"Andreia Correia, P. Ramalhete, P. Felber","doi":"10.1145/3437801.3441596","DOIUrl":"https://doi.org/10.1145/3437801.3441596","url":null,"abstract":"Dynamic lock-free data structures require a memory reclamation scheme with a similar progress. Until today, lock-free schemes are applied to data structures on a case-by-case basis, often with algorithm modifications to the data structure. In this paper we introduce two new lock-free reclamation schemes, one manual and the other automatic with user annotated types. The manual reclamation scheme, named pass-the-pointer (PTP), has lock-free progress and a bound on the number of unreclaimed objects that is linear with the number of threads. The automatic lock-free memory reclamation scheme, which we named OrcGC, uses PTP and object reference counting to automatically detect when to protect and when to de-allocate an object. OrcGC has a linear bound on memory usage and can be used with any allocator. We propose a new methodology that utilizes OrcGC to provide lock-free memory reclamation to a data structure. We conducted a performance evaluation on two machines, an Intel and an AMD, applying PTP and OrcGC to several lock-free data structures, providing lock-free memory reclamation where before there was none. On the Intel machine we saw no significant performance impact, while on AMD we observed a worst-case performance drop below 50%.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126026808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Asynchrony versus bulk-synchrony for a generalized N-body problem from genomics 基因组学广义n体问题的异步与大体积同步

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441580

Marquita Ellis, A. Buluç, K. Yelick

{"title":"Asynchrony versus bulk-synchrony for a generalized N-body problem from genomics","authors":"Marquita Ellis, A. Buluç, K. Yelick","doi":"10.1145/3437801.3441580","DOIUrl":"https://doi.org/10.1145/3437801.3441580","url":null,"abstract":"This work examines a data-intensive irregular application from genomics, a long-read to long-read alignment problem, which represents a kind of Generalized N-Body problem, one of the \"seven giants\" of the NRC Big Data motifs [5]. In this problem, computations (genomic alignments) are performed on sparse and data-dependent pairs of inputs, with variable cost computation and variable datum sizes. In particular, there is no inherent locality in the pairwise interactions, unlike simulation-based N-Body problems, and the interaction sparsity depends on particular parameters of the input, which can also affect the quality of the output. We examine two extremes to distributed memory parallelization for this problem, bulk-synchrony and asynchrony, with real workloads. Our bulk-synchronous implementation, uses collective communication in MPI, while our asynchronous implementation uses cross-node RPCs in UPC++. We show that the asynchronous version effectively hides communication costs, with a memory footprint that is typically much lower than the bulk-synchronous version. Our application, while simple enough to be a kind of proxy for genomics or data analytics applications more broadly, is also part of a real application pipeline. It shows good scaling on real input problems, and at the same time, reveals some of the programming and architectural challenges for scaling this type of data-intensive irregular application.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114452845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating the semantics of futures in transactional memory systems 研究事务性内存系统中期货的语义

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441594

Jingna Zeng, S. Issa, P. Romano, L. Rodrigues, Seif Haridi

引用次数: 2

In-situ workflow auto-tuning through combining component models 结合组件模型进行现场工作流自动调优

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441615

Tong Shu, Yanfei Guo, J. Wozniak, Xiaoning Ding, Ian T Foster, T. Kurç

引用次数: 4

An ownership policy and deadlock detector for promises 承诺的所有权策略和死锁检测器

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-01-05 DOI: 10.1145/3437801.3441616

Caleb Voss, Vivek Sarkar

{"title":"An ownership policy and deadlock detector for promises","authors":"Caleb Voss, Vivek Sarkar","doi":"10.1145/3437801.3441616","DOIUrl":"https://doi.org/10.1145/3437801.3441616","url":null,"abstract":"Task-parallel programs often enjoy deadlock freedom under certain restrictions, such as the use of structured join operations, as in Cilk and X10, or the use of asynchronous task futures together with deadlock-avoiding policies such as Known Joins or Transitive Joins. However, the promise, a popular synchronization primitive for parallel tasks, does not enjoy deadlock-freedom guarantees. Promises can exhibit deadlock-like bugs; however, the concept of a deadlock is not currently well-defined for promises. To address these challenges, we propose an ownership semantics in which each promise is associated to the task which currently intends to fulfill it. Ownership immediately enables the identification of bugs in which a task fails to fulfill a promise for which it is responsible. Ownership further enables the discussion of deadlock cycles among tasks and promises and allows us to introduce a robust definition of deadlock-like bugs for promises. Cycle detection in this context is non-trivial because it is concurrent with changes in promise ownership. We provide a lock-free algorithm for precise runtime deadlock detection. We show how to obtain the memory consistency criteria required for the correctness of our algorithm under TSO and the Java and C++ memory models. An evaluation compares the execution time and memory usage overheads of our detection algorithm on benchmark programs relative to an unverified baseline. Our detector exhibits a 12% (1.12×) geometric mean time overhead and a 6% (1.06×) geometric mean memory overhead, which are smaller overheads than in past approaches to deadlock cycle detection.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123284512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Extracting clean performance models from tainted programs 从受污染的程序中提取干净的性能模型

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2020-12-31 DOI: 10.1145/3437801.3441613

Marcin Copik, A. Calotoiu, T. Grosser, Nicolas Wicki, F. Wolf, T. Hoefler

引用次数: 8

Bundled references: an abstraction for highly-concurrent linearizable range queries 绑定引用:对高并发线性范围查询的抽象

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2020-12-31 DOI: 10.1145/3437801.3441614

J. Nelson, A. Hassan, R. Palmieri

引用次数: 5

I/O lower bounds for auto-tuning of convolutions in CNNs cnn中卷积自调优的I/O下界

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2020-12-31 DOI: 10.1145/3437801.3441609

Xiaoyang Zhang, Junmin Xiao, Guangming Tan

{"title":"I/O lower bounds for auto-tuning of convolutions in CNNs","authors":"Xiaoyang Zhang, Junmin Xiao, Guangming Tan","doi":"10.1145/3437801.3441609","DOIUrl":"https://doi.org/10.1145/3437801.3441609","url":null,"abstract":"Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model samples, the convolution suffers from high overhead on data movement (i.e., memory access). This work provides comprehensive analysis and methodologies to minimize the communication for the convolution in CNNs. With an in-depth analysis of the recent I/O complexity theory under the red-blue game model, we develop a general I/O lower bound theory for a composite algorithm which consists of several different sub-computations. Based on the proposed theory, we establish the data movement lower bound results for two main convolution algorithms in CNNs, namely the direct convolution and Winograd algorithm, which represents the direct and indirect implementations of a convolution respectively. Next, derived from I/O lower bound results, we design the near I/O-optimal dataflow strategies for the two main convolution algorithms by fully exploiting the data reuse. Furthermore, in order to push the envelope of performance of the near I/O-optimal dataflow strategies further, an aggressive design of auto-tuning based on I/O lower bounds, is proposed to search an optimal parameter configuration for the direct convolution and Winograd algorithm on GPU, such as the number of threads and the size of shared memory used in each thread block. Finally, experiment evaluation results on the direct convolution and Winograd algorithm show that our dataflow strategies with the auto-tuning approach can achieve about 3.32× performance speedup on average over cuDNN. In addition, compared with TVM, which represents the state-of-the-art technique for auto-tuning, not only our auto-tuning method based on I/O lower bounds can find the optimal parameter configuration faster, but also our solution has higher performance than the optimal solution provided by TVM.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124181721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

NBR: neutralization based reclamation NBR:中和基回收

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2020-12-29 DOI: 10.1145/3437801.3441625

Ajay Singh, Trevor Brown, A. Mashtizadeh

{"title":"NBR: neutralization based reclamation","authors":"Ajay Singh, Trevor Brown, A. Mashtizadeh","doi":"10.1145/3437801.3441625","DOIUrl":"https://doi.org/10.1145/3437801.3441625","url":null,"abstract":"Safe memory reclamation (SMR) algorithms suffer from a trade-off between bounding unreclaimed memory and the speed of reclamation. Hazard pointer (HP) based algorithms bound unreclaimed memory at all times, but tend to be slower than other approaches. Epoch based reclamation (EBR) algorithms are faster, but do not bound memory reclamation. Other algorithms follow hybrid approaches, requiring special compiler or hardware support, changes to record layouts, and/or extensive code changes. Not all SMR algorithms can be used to reclaim memory for all data structures. We propose a new neutralization based reclamation (NBR) algorithm that is often faster than the best known EBR algorithms and achieves bounded unreclaimed memory. It is non-blocking when used with a non-blocking operating system (OS) kernel, and only requires atomic read, write and CAS. NBR is straightforward to use with many different data structures, and in most cases, requires similar reasoning and programmer effort to two-phased locking. NBR is implemented using OS signals and a lightweight handshaking mechanism between participating threads to determine when it is safe to reclaim a record. Experiments on a lock-based binary search tree and a lazy linked list show that NBR significantly outperforms many state of the art reclamation algorithms. In the tree, NBR is faster than next best algorithm, DEBRA, by up to 38% and HP by up to 17%. And, in the list, NBR is 15% and 243% faster than DEBRA and HP, respectively.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125921398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Verifying C11-style weak memory libraries 验证c11风格的弱内存库

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2020-12-28 DOI: 10.1145/3437801.3441619

Sadegh Dalvandi, Brijesh Dongol

引用次数: 4