ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming最新文献_第8页

A decomposition for in-place matrix transposition 就地矩阵转置的分解

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555253

Bryan Catanzaro, A. Keller, M. Garland

{"title":"A decomposition for in-place matrix transposition","authors":"Bryan Catanzaro, A. Keller, M. Garland","doi":"10.1145/2555243.2555253","DOIUrl":"https://doi.org/10.1145/2555243.2555253","url":null,"abstract":"We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m, n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny matrices that arise when converting Arrays of Structures to Structures of Arrays yields median throughput of 34.3 GB/s, and a maximum throughput of 51 GB/s.\u0000 Because of the simple structure of this algorithm, it is particularly suited for implementation using SIMD instructions to transpose the small arrays that arise when SIMD processors load from or store to Arrays of Structures. Using this algorithm to cooperatively perform accesses to Arrays of Structures, we measure 180 GB/s throughput on the K20c, which is up to 45 times faster than compiler-generated Array of Structures accesses.\u0000 In this paper, we explain the algorithm, prove its correctness and complexity, and explain how it can be instantiated efficiently for solving various transpose problems on both CPUs and GPUs.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130515775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

A practical wait-free simulation for lock-free data structures 一个实用的无锁数据结构的无等待模拟

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555261

Shahar Timnat, E. Petrank

引用次数: 64

Portable, MPI-interoperable coarray fortran 便携式，mpi互操作的队列fortran

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555270

Chaoran Yang, Wesley Bland, J. Mellor-Crummey, P. Balaji

引用次数: 12

Heterogeneous computing: what does it mean for compiler research? 异构计算:它对编译器研究意味着什么?

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2692916.2558891

Norman Rubin

{"title":"Heterogeneous computing: what does it mean for compiler research?","authors":"Norman Rubin","doi":"10.1145/2692916.2558891","DOIUrl":"https://doi.org/10.1145/2692916.2558891","url":null,"abstract":"The current trend in computer architecture is to increase the number of cores, to create specialized types of cores within a single machine, and to network such machines together in very fluid web/cloud computing arrangements. Compilers have traditionally focused on optimizations to code that improve performance, but is that the right target to speed up real applications? Consider loading a web page (like starting GMAIL) the page is transferred to the client, any JavaScript is compiled, the JavaScript executes, and the page gets displayed. The classic compiler model (which was first developed in the late 50's) was a great fit for single core machines but has fallen behind architecture, and language. For example how do you compile a single program for a machine that has both a CPU and a graphics coprocessor (a GPU) with a very different programming and memory model? Together with the changes in architecture there have been changes in programming languages. Dynamic languages are used more, static languages are used less. How does this effect compiler research? In this talk, I'll review a number of traditional compiler research challenges that have (or will) become burning issues and will describe some new problems areas that were not considered in the past. For example language specifica-tions are large complex technical documents that are difficult for non-experts to follow. Application programmers are often not willing to read these documents; can a compiler bridge the gap?","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131678561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Theoretical analysis of classic algorithms on highly-threaded many-core GPUs 经典算法在高线程多核gpu上的理论分析

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555285

Lin Ma, Kunal Agrawal, R. Chamberlain

引用次数: 9

Data structures for task-based priority scheduling 基于任务优先级调度的数据结构

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-12-09 DOI: 10.1145/2555243.2555278

Martin Wimmer, F. Versaci, J. Träff, Daniel Cederman, P. Tsigas

{"title":"Data structures for task-based priority scheduling","authors":"Martin Wimmer, F. Versaci, J. Träff, Daniel Cederman, P. Tsigas","doi":"10.1145/2555243.2555278","DOIUrl":"https://doi.org/10.1145/2555243.2555278","url":null,"abstract":"We present three lock-free data structures for priority task scheduling: a priority work-stealing one, a centralized one with ρ-relaxed semantics, and a hybrid one combining both concepts. With the single-source shortest path (SSSP) problem as example, we show how the different approaches affect the prioritization and provide upper bounds on the number of examined nodes. We argue that priority task scheduling allows for an intuitive and easy way to parallelize the SSSP problem, notoriously a hard task. Experimental evidence supports the good scalability of the resulting algorithm. The larger aim of this work is to understand the trade-offs between scalability and priority guarantees in task scheduling systems. We show that ρ-relaxation is a valuable technique for improving the first, while still allowing semantic constraints to be satisfied: the lock-free, hybrid $k$-priority data structure can scale as well as work-stealing, while still providing strong priority scheduling guarantees, which depend on the parameter k. Our theoretical results open up possibilities for even more scalable data structures by adopting a weaker form of ρ-relaxation, which still enables the semantic constraints to be respected.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131049656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

From relational verification to SIMD loop synthesis 从关系验证到SIMD回路合成

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-02-23 DOI: 10.1145/2442516.2442529

G. Barthe, Juan Manuel Crespo, Sumit Gulwani, César Kunz, Mark Marron

引用次数: 61

StreamScan: fast scan algorithms for GPUs without global barrier synchronization StreamScan:快速扫描算法的gpu没有全局屏障同步

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-02-23 DOI: 10.1145/2442516.2442539

Shengen Yan, Guoping Long, Yunquan Zhang

{"title":"StreamScan: fast scan algorithms for GPUs without global barrier synchronization","authors":"Shengen Yan, Guoping Long, Yunquan Zhang","doi":"10.1145/2442516.2442539","DOIUrl":"https://doi.org/10.1145/2442516.2442539","url":null,"abstract":"Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128513272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 76

Fast concurrent queues for x86 processors 用于x86处理器的快速并发队列

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-02-23 DOI: 10.1145/2442516.2442527

Adam Morrison, Y. Afek

引用次数: 138

ZOOMM: a parallel web browser engine for multicore mobile devices ZOOMM:用于多核移动设备的并行web浏览器引擎

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-02-23 DOI: 10.1145/2442516.2442543

Calin Cascaval, Seth Fowler, Pablo Montesinos, W. Piekarski, Mehrdad Reshadi, Behnam Robatmili, Michael Weber, Vrajesh Bhavsar

引用次数: 34