ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming最新文献

筛选
英文 中文
In-place transposition of rectangular matrices on accelerators 矩形矩阵在加速器上的原位移位
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555266
I-Jui Sung, Juan Gómez-Luna, José María González-Linares, Nicolás Guil Mata, Wen-mei W. Hwu
{"title":"In-place transposition of rectangular matrices on accelerators","authors":"I-Jui Sung, Juan Gómez-Luna, José María González-Linares, Nicolás Guil Mata, Wen-mei W. Hwu","doi":"10.1145/2555243.2555266","DOIUrl":"https://doi.org/10.1145/2555243.2555266","url":null,"abstract":"Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPUs to achieve good performance. In this paper we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a novel 3-stage transposition algorithm where each stage is performed using an elementary tiled-wise transposition. Additionally, when transposition is done as part of the memory transfer between GPU and host, our staged approach allows hiding transposition overhead by overlap with PCIe transfer. We show that the 3-stage algorithm allows larger tiles and achieves 3X speedup over a traditional 4-stage algorithm, with both algorithms based on our high-performance elementary transpositions on the GPU. We also show our proposed low-level optimizations improve the sustained throughput to more than 20 GB/s. Finally, we propose an asynchronous execution scheme that allows CPU threads to delegate in-place matrix transposition to GPU, achieving a throughput of more than 3.4 GB/s (including data transfers costs), and improving current multithreaded implementations of in-place transposition on CPU.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124372860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Resilient X10: efficient failure-aware programming 弹性X10:高效的故障感知编程
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555248
D. Cunningham, D. Grove, Benjamin Herta, A. Iyengar, K. Kawachiya, H. Murata, V. Saraswat, Mikio Takeuchi, O. Tardieu
{"title":"Resilient X10: efficient failure-aware programming","authors":"D. Cunningham, D. Grove, Benjamin Herta, A. Iyengar, K. Kawachiya, H. Murata, V. Saraswat, Mikio Takeuchi, O. Tardieu","doi":"10.1145/2555243.2555248","DOIUrl":"https://doi.org/10.1145/2555243.2555248","url":null,"abstract":"Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. The advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framework handles scale-out and resilience automatically.\u0000 We are concerned with the development of general-purpose languages that support resilient programming. In this paper we show how the X10 language and implementation can be extended to support resilience. In Resilient X10, places may fail asynchronously, causing loss of the data and tasks at the failed place. Failure is exposed through exceptions. We identify a {em Happens Before Invariance Principle} and require the runtime to automatically repair the global control structure of the program to maintain this principle. We show this reduces much of the burden of resilient programming. The programmer is only responsible for continuing execution with fewer computational resources and the loss of part of the heap, and can do so while taking advantage of domain knowledge.\u0000 We build a complete implementation of the language, capable of executing benchmark applications on hundreds of nodes. We describe the algorithms required to make the language runtime resilient. We then give three applications, each with a different approach to fault tolerance (replay, decimation, and domain-level checkpointing). These can be executed at scale and survive node failure. We show that for these programs the overhead of resilience is a small fraction of overall runtime by comparing to equivalent non-resilient X10 programs. On one program we show end-to-end performance of Resilient X10 is ~100x faster than Hadoop.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123222853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
A general technique for non-blocking trees 非阻塞树的一般技术
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555267
Trevor Brown, Faith Ellen, E. Ruppert
{"title":"A general technique for non-blocking trees","authors":"Trevor Brown, Faith Ellen, E. Ruppert","doi":"10.1145/2555243.2555267","DOIUrl":"https://doi.org/10.1145/2555243.2555267","url":null,"abstract":"We describe a general technique for obtaining provably correct, non-blocking implementations of a large class of tree data structures where pointers are directed from parents to children. Updates are permitted to modify any contiguous portion of the tree atomically. Our non-blocking algorithms make use of the LLX, SCX and VLX primitives, which are multi-word generalizations of the standard LL, SC and VL primitives and have been implemented from single-word CAS. To illustrate our technique, we describe how it can be used in a fairly straightforward way to obtain a non-blocking implementation of a chromatic tree, which is a relaxed variant of a red-black tree. The height of the tree at any time is O(c + log n), where n is the number of keys and c is the number of updates in progress. We provide an experimental performance analysis which demonstrates that our Java implementation of a chromatic tree rivals, and often significantly outperforms, other leading concurrent dictionaries.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127142148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 103
Triolet: a programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing Triolet:为高性能集群计算统一算法骨架接口的编程系统
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555268
Christopher I. Rodrigues, T. Jablin, Abdul Dakkak, Wen-mei W. Hwu
{"title":"Triolet: a programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing","authors":"Christopher I. Rodrigues, T. Jablin, Abdul Dakkak, Wen-mei W. Hwu","doi":"10.1145/2555243.2555268","DOIUrl":"https://doi.org/10.1145/2555243.2555268","url":null,"abstract":"Functional algorithmic skeletons promise a high-level programming interface for distributed-memory clusters that free developers from concerns of task decomposition, scheduling, and communication. Unfortunately, prior distributed functional skeleton frameworks do not deliver performance comparable to that achievable in a low-level distributed programming model such as C with MPI and OpenMP, even when used in concert with high-performance array libraries. There are several causes: they do not take advantage of shared memory on each cluster node; they impose a fixed partitioning strategy on input data; and they have limited ability to fuse loops involving skeletons that produce a variable number of outputs per input.\u0000 We address these shortcomings in the Triolet programming language through a modular library design that separates concerns of parallelism, loop nesting, and data partitioning. We show how Triolet substantially improves the parallel performance of algorithms involving array traversals and nested, variable-size loops over what is achievable in Eden, a distributed variant of Haskell. We further demonstrate how Triolet can substantially simplify parallel programming relative to C with MPI and OpenMP while achieving 23--100% of its performance on a 128-core cluster.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127009413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Revisiting loop fusion in the polyhedral framework 重述多面体框架中的环融合
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555250
Sanyam Mehta, P. Lin, P. Yew
{"title":"Revisiting loop fusion in the polyhedral framework","authors":"Sanyam Mehta, P. Lin, P. Yew","doi":"10.1145/2555243.2555250","DOIUrl":"https://doi.org/10.1145/2555243.2555250","url":null,"abstract":"Loop fusion is an important compiler optimization for improving memory hierarchy performance through enabling data reuse. Traditional compilers have approached loop fusion in a manner decoupled from other high-level loop optimizations, missing several interesting solutions. Recently, the polyhedral compiler framework with its ability to compose complex transformations, has proved to be promising in performing loop optimizations for small programs. However, our experiments with large programs using state-of-the-art polyhedral compiler frameworks reveal suboptimal fusion partitions in the transformed code. We trace the reason for this to be lack of an effective cost model to choose a good fusion partitioning among the possible choices, which increase exponentially with the number of program statements. In this paper, we propose a fusion algorithm to choose good fusion partitions with two objective functions - achieving good data reuse and preserving parallelism inherent in the source code. These objectives, although targeted by previous work in traditional compilers, pose new challenges within the polyhedral compiler framework and have thus not been addressed. In our algorithm, we propose several heuristics that work effectively within the polyhedral compiler framework and allow us to achieve the proposed objectives. Experimental results show that our fusion algorithm achieves performance comparable to the existing polyhedral compilers for small kernel programs, and significantly outperforms them for large benchmark programs such as those in the SPEC benchmark suite.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114084355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Singe: leveraging warp specialization for high performance on GPUs single:利用warp专门化在gpu上实现高性能
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555258
Michael A. Bauer, Sean Treichler, A. Aiken
{"title":"Singe: leveraging warp specialization for high performance on GPUs","authors":"Michael A. Bauer, Sean Treichler, A. Aiken","doi":"10.1145/2555243.2555258","DOIUrl":"https://doi.org/10.1145/2555243.2555258","url":null,"abstract":"We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code for GPUs. Instead of relying on traditional GPU programming models that emphasize data-parallel computations, warp specialization allows compilers like Singe to partition computations into sub-computations which are then assigned to different warps within a thread block. Fine-grain synchronization between warps is performed efficiently in hardware using producer-consumer named barriers. Partitioning computations using warp specialization allows Singe to deal efficiently with the irregularity in both data access patterns and computation. Furthermore, warp-specialized partitioning of computations allows Singe to fit extremely large working sets into on-chip memories. Finally, we describe the architecture and general compilation techniques necessary for constructing a warp-specializing compiler. We show that the warp-specialized code emitted by Singe is up to 3.75X faster than previously optimized data-parallel GPU kernels.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115659269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Fast concurrent lock-free binary search trees 快速并发无锁二叉搜索树
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555256
Aravind Natarajan, N. Mittal
{"title":"Fast concurrent lock-free binary search trees","authors":"Aravind Natarajan, N. Mittal","doi":"10.1145/2555243.2555256","DOIUrl":"https://doi.org/10.1145/2555243.2555256","url":null,"abstract":"We present a new lock-free algorithm for concurrent manipulation of a binary search tree in an asynchronous shared memory system that supports search, insert and delete operations. In addition to read and write instructions, our algorithm uses (single-word) compare-and-swap (CAS) and bit-test-and-set (SETB) atomic instructions, both of which are commonly supported by many modern processors including Intel~64 and AMD64.\u0000 In contrast to existing lock-free algorithms for a binary search tree, our algorithm is based on marking edges rather than nodes. As a result, when compared to other lock-free algorithms, modify (insert and delete) operations in our algorithm work on a smaller portion of the tree, thereby reducing conflicts, and execute fewer atomic instructions (one for insert and three for delete). Our experiments indicate that our lock-free algorithm significantly outperforms all other algorithms for a concurrent binary search tree in many cases, especially when contention is high, by as much as 100%.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115062290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 178
Automatic semantic locking 自动语义锁定
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555281
Guy Golan-Gueta, G. Ramalingam, Shmuel Sagiv, Eran Yahav
{"title":"Automatic semantic locking","authors":"Guy Golan-Gueta, G. Ramalingam, Shmuel Sagiv, Eran Yahav","doi":"10.1145/2555243.2555281","DOIUrl":"https://doi.org/10.1145/2555243.2555281","url":null,"abstract":"In this paper, we consider concurrent programs in which the shared state consists of instances of linearizable ADTs (abstract data types). We develop a novel automated approach to concurrency control that addresses a common need: the need to atomically execute a code fragment, which may contain multiple ADT operations on multiple ADT instances. In our approach, each ADT implements ADT-specific semantic locking operations that serve to exploit the semantics of ADT operations. We develop a synthesis algorithm that automatically inserts calls to these locking operations in a set of given code fragments (in a client program) to ensure that these code fragments execute atomically without deadlocks, and without rollbacks.\u0000 We have implemented the synthesis algorithm and several general-purpose ADTs with semantic locking. We have applied the synthesis algorithm to several Java programs that use these ADTs. Our results show that our approach enables efficient and scalable synchronization.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124252466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Time-warp: lightweight abort minimization in transactional memory 时间扭曲:事务内存中的轻量级中断最小化
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555259
Nuno Diegues, P. Romano
{"title":"Time-warp: lightweight abort minimization in transactional memory","authors":"Nuno Diegues, P. Romano","doi":"10.1145/2555243.2555259","DOIUrl":"https://doi.org/10.1145/2555243.2555259","url":null,"abstract":"The notion of permissiveness in Transactional Memory (TM) translates to only aborting a transaction when it cannot be accepted in any history that guarantees correctness criterion. This property is neglected by most TMs, which, in order to maximize implementation's efficiency, resort to aborting transactions under overly conservative conditions. In this paper we seek to identify a sweet spot between permissiveness and efficiency by introducing the Time-Warp Multi-version algorithm (TWM). TWM is based on the key idea of allowing an update transaction that has performed stale reads (i.e., missed the writes of concurrently committed transactions) to be serialized by committing it in the past, which we call a time-warp commit. At its core, TWM uses a novel, lightweight validation mechanism with little computational overheads. TWM also guarantees that read-only transactions can never be aborted. Further, TWM guarantees Virtual World Consistency, a safety property that is deemed as particularly relevant in the context of TM. We demonstrate the practicality of this approach through an extensive experimental study, where we compare TWM with four other TMs, and show an average performance improvement of 65% in high concurrency scenarios.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126352866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Lock contention aware thread migrations 锁定感知争用的线程迁移
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555273
K. Pusukuri, Rajiv Gupta, L. Bhuyan
{"title":"Lock contention aware thread migrations","authors":"K. Pusukuri, Rajiv Gupta, L. Bhuyan","doi":"10.1145/2555243.2555273","DOIUrl":"https://doi.org/10.1145/2555243.2555273","url":null,"abstract":"On a cache-coherent multicore multiprocessor system, the performance of a multithreaded application with high lock contention is very sensitive to the distribution of application threads across multiple processors. This is because the distribution of threads impacts the frequency of lock transfers between processors, which in turn impacts the frequency of last-level cache (LLC) misses that lie on the critical path of execution. Inappropriate distribution of threads across processors increases LLC misses in the critical path and significantly degrades performance of multithreaded programs. To alleviate the above problem, this paper overviews a thread migration technique, which migrates threads of a multithreaded program across multicore processors so that threads seeking locks are more likely to find the locks on the same processor.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131911015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信