ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming最新文献_第7页

In-place transposition of rectangular matrices on accelerators 矩形矩阵在加速器上的原位移位

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555266

I-Jui Sung, Juan Gómez-Luna, José María González-Linares, Nicolás Guil Mata, Wen-mei W. Hwu

{"title":"In-place transposition of rectangular matrices on accelerators","authors":"I-Jui Sung, Juan Gómez-Luna, José María González-Linares, Nicolás Guil Mata, Wen-mei W. Hwu","doi":"10.1145/2555243.2555266","DOIUrl":"https://doi.org/10.1145/2555243.2555266","url":null,"abstract":"Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPUs to achieve good performance. In this paper we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a novel 3-stage transposition algorithm where each stage is performed using an elementary tiled-wise transposition. Additionally, when transposition is done as part of the memory transfer between GPU and host, our staged approach allows hiding transposition overhead by overlap with PCIe transfer. We show that the 3-stage algorithm allows larger tiles and achieves 3X speedup over a traditional 4-stage algorithm, with both algorithms based on our high-performance elementary transpositions on the GPU. We also show our proposed low-level optimizations improve the sustained throughput to more than 20 GB/s. Finally, we propose an asynchronous execution scheme that allows CPU threads to delegate in-place matrix transposition to GPU, achieving a throughput of more than 3.4 GB/s (including data transfers costs), and improving current multithreaded implementations of in-place transposition on CPU.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124372860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Resilient X10: efficient failure-aware programming 弹性X10:高效的故障感知编程

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555248

D. Cunningham, D. Grove, Benjamin Herta, A. Iyengar, K. Kawachiya, H. Murata, V. Saraswat, Mikio Takeuchi, O. Tardieu

{"title":"Resilient X10: efficient failure-aware programming","authors":"D. Cunningham, D. Grove, Benjamin Herta, A. Iyengar, K. Kawachiya, H. Murata, V. Saraswat, Mikio Takeuchi, O. Tardieu","doi":"10.1145/2555243.2555248","DOIUrl":"https://doi.org/10.1145/2555243.2555248","url":null,"abstract":"Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. The advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framework handles scale-out and resilience automatically.\u0000 We are concerned with the development of general-purpose languages that support resilient programming. In this paper we show how the X10 language and implementation can be extended to support resilience. In Resilient X10, places may fail asynchronously, causing loss of the data and tasks at the failed place. Failure is exposed through exceptions. We identify a {em Happens Before Invariance Principle} and require the runtime to automatically repair the global control structure of the program to maintain this principle. We show this reduces much of the burden of resilient programming. The programmer is only responsible for continuing execution with fewer computational resources and the loss of part of the heap, and can do so while taking advantage of domain knowledge.\u0000 We build a complete implementation of the language, capable of executing benchmark applications on hundreds of nodes. We describe the algorithms required to make the language runtime resilient. We then give three applications, each with a different approach to fault tolerance (replay, decimation, and domain-level checkpointing). These can be executed at scale and survive node failure. We show that for these programs the overhead of resilience is a small fraction of overall runtime by comparing to equivalent non-resilient X10 programs. On one program we show end-to-end performance of Resilient X10 is ~100x faster than Hadoop.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123222853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

A general technique for non-blocking trees 非阻塞树的一般技术

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555267

Trevor Brown, Faith Ellen, E. Ruppert

引用次数: 103

Triolet: a programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing Triolet：为高性能集群计算统一算法骨架接口的编程系统

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555268

Christopher I. Rodrigues, T. Jablin, Abdul Dakkak, Wen-mei W. Hwu

{"title":"Triolet: a programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing","authors":"Christopher I. Rodrigues, T. Jablin, Abdul Dakkak, Wen-mei W. Hwu","doi":"10.1145/2555243.2555268","DOIUrl":"https://doi.org/10.1145/2555243.2555268","url":null,"abstract":"Functional algorithmic skeletons promise a high-level programming interface for distributed-memory clusters that free developers from concerns of task decomposition, scheduling, and communication. Unfortunately, prior distributed functional skeleton frameworks do not deliver performance comparable to that achievable in a low-level distributed programming model such as C with MPI and OpenMP, even when used in concert with high-performance array libraries. There are several causes: they do not take advantage of shared memory on each cluster node; they impose a fixed partitioning strategy on input data; and they have limited ability to fuse loops involving skeletons that produce a variable number of outputs per input.\u0000 We address these shortcomings in the Triolet programming language through a modular library design that separates concerns of parallelism, loop nesting, and data partitioning. We show how Triolet substantially improves the parallel performance of algorithms involving array traversals and nested, variable-size loops over what is achievable in Eden, a distributed variant of Haskell. We further demonstrate how Triolet can substantially simplify parallel programming relative to C with MPI and OpenMP while achieving 23--100% of its performance on a 128-core cluster.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127009413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Revisiting loop fusion in the polyhedral framework 重述多面体框架中的环融合

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555250

Sanyam Mehta, P. Lin, P. Yew

{"title":"Revisiting loop fusion in the polyhedral framework","authors":"Sanyam Mehta, P. Lin, P. Yew","doi":"10.1145/2555243.2555250","DOIUrl":"https://doi.org/10.1145/2555243.2555250","url":null,"abstract":"Loop fusion is an important compiler optimization for improving memory hierarchy performance through enabling data reuse. Traditional compilers have approached loop fusion in a manner decoupled from other high-level loop optimizations, missing several interesting solutions. Recently, the polyhedral compiler framework with its ability to compose complex transformations, has proved to be promising in performing loop optimizations for small programs. However, our experiments with large programs using state-of-the-art polyhedral compiler frameworks reveal suboptimal fusion partitions in the transformed code. We trace the reason for this to be lack of an effective cost model to choose a good fusion partitioning among the possible choices, which increase exponentially with the number of program statements. In this paper, we propose a fusion algorithm to choose good fusion partitions with two objective functions - achieving good data reuse and preserving parallelism inherent in the source code. These objectives, although targeted by previous work in traditional compilers, pose new challenges within the polyhedral compiler framework and have thus not been addressed. In our algorithm, we propose several heuristics that work effectively within the polyhedral compiler framework and allow us to achieve the proposed objectives. Experimental results show that our fusion algorithm achieves performance comparable to the existing polyhedral compilers for small kernel programs, and significantly outperforms them for large benchmark programs such as those in the SPEC benchmark suite.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114084355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Singe: leveraging warp specialization for high performance on GPUs single:利用warp专门化在gpu上实现高性能

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555258

Michael A. Bauer, Sean Treichler, A. Aiken

引用次数: 44

Fast concurrent lock-free binary search trees 快速并发无锁二叉搜索树

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555256

Aravind Natarajan, N. Mittal

引用次数: 178

Automatic semantic locking 自动语义锁定

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555281

Guy Golan-Gueta, G. Ramalingam, Shmuel Sagiv, Eran Yahav

引用次数: 4

Time-warp: lightweight abort minimization in transactional memory 时间扭曲:事务内存中的轻量级中断最小化

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555259

Nuno Diegues, P. Romano

{"title":"Time-warp: lightweight abort minimization in transactional memory","authors":"Nuno Diegues, P. Romano","doi":"10.1145/2555243.2555259","DOIUrl":"https://doi.org/10.1145/2555243.2555259","url":null,"abstract":"The notion of permissiveness in Transactional Memory (TM) translates to only aborting a transaction when it cannot be accepted in any history that guarantees correctness criterion. This property is neglected by most TMs, which, in order to maximize implementation's efficiency, resort to aborting transactions under overly conservative conditions. In this paper we seek to identify a sweet spot between permissiveness and efficiency by introducing the Time-Warp Multi-version algorithm (TWM). TWM is based on the key idea of allowing an update transaction that has performed stale reads (i.e., missed the writes of concurrently committed transactions) to be serialized by committing it in the past, which we call a time-warp commit. At its core, TWM uses a novel, lightweight validation mechanism with little computational overheads. TWM also guarantees that read-only transactions can never be aborted. Further, TWM guarantees Virtual World Consistency, a safety property that is deemed as particularly relevant in the context of TM. We demonstrate the practicality of this approach through an extensive experimental study, where we compare TWM with four other TMs, and show an average performance improvement of 65% in high concurrency scenarios.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126352866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Lock contention aware thread migrations 锁定感知争用的线程迁移

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555273

K. Pusukuri, Rajiv Gupta, L. Bhuyan

引用次数: 2