ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming最新文献

筛选
英文 中文
A decomposition for in-place matrix transposition 就地矩阵转置的分解
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555253
Bryan Catanzaro, A. Keller, M. Garland
{"title":"A decomposition for in-place matrix transposition","authors":"Bryan Catanzaro, A. Keller, M. Garland","doi":"10.1145/2555243.2555253","DOIUrl":"https://doi.org/10.1145/2555243.2555253","url":null,"abstract":"We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m, n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny matrices that arise when converting Arrays of Structures to Structures of Arrays yields median throughput of 34.3 GB/s, and a maximum throughput of 51 GB/s.\u0000 Because of the simple structure of this algorithm, it is particularly suited for implementation using SIMD instructions to transpose the small arrays that arise when SIMD processors load from or store to Arrays of Structures. Using this algorithm to cooperatively perform accesses to Arrays of Structures, we measure 180 GB/s throughput on the K20c, which is up to 45 times faster than compiler-generated Array of Structures accesses.\u0000 In this paper, we explain the algorithm, prove its correctness and complexity, and explain how it can be instantiated efficiently for solving various transpose problems on both CPUs and GPUs.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130515775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
A practical wait-free simulation for lock-free data structures 一个实用的无锁数据结构的无等待模拟
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555261
Shahar Timnat, E. Petrank
{"title":"A practical wait-free simulation for lock-free data structures","authors":"Shahar Timnat, E. Petrank","doi":"10.1145/2555243.2555261","DOIUrl":"https://doi.org/10.1145/2555243.2555261","url":null,"abstract":"Lock-free data structures guarantee overall system progress, whereas wait-free data structures guarantee the progress of each and every thread, providing the desirable non-starvation guarantee for concurrent data structures. While practical lock-free implementations are known for various data structures, wait-free data structure designs are rare. Wait-free implementations have been notoriously hard to design and often inefficient. In this work we present a transformation of lock-free algorithms to wait-free ones allowing even a non-expert to transform a lock-free data-structure into a practical wait-free one. The transformation requires that the lock-free data structure is given in a normalized form defined in this work. Using the new method, we have designed and implemented wait-free linked-list, skiplist, and tree and we measured their performance. It turns out that for all these data structures the wait-free implementations are only a few percent slower than their lock-free counterparts, while still guaranteeing non-starvation.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127095995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
Portable, MPI-interoperable coarray fortran 便携式,mpi互操作的队列fortran
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555270
Chaoran Yang, Wesley Bland, J. Mellor-Crummey, P. Balaji
{"title":"Portable, MPI-interoperable coarray fortran","authors":"Chaoran Yang, Wesley Bland, J. Mellor-Crummey, P. Balaji","doi":"10.1145/2555243.2555270","DOIUrl":"https://doi.org/10.1145/2555243.2555270","url":null,"abstract":"The past decade has seen the advent of a number of parallel programming models such as Coarray Fortran (CAF), Unified Parallel C, X10, and Chapel. Despite the productivity gains promised by these models, most parallel scientific applications still rely on MPI as their data movement model. One reason for this trend is that it is hard for users to incrementally adopt these new programming models in existing MPI applications. Because each model use its own runtime system, they duplicate resources and are potentially error-prone. Such independent runtime systems were deemed necessary because MPI was considered insufficient in the past to play this role for these languages.\u0000 The recently released MPI-3, however, adds several new capabilities that now provide all of the functionality needed to act as a runtime, including a much more comprehensive one-sided communication framework. In this paper, we investigate how MPI-3 can form a runtime system for one example programming model, CAF, with a broader goal of enabling a single application to use both MPI and CAF with the highest level of interoperability.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131689935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Heterogeneous computing: what does it mean for compiler research? 异构计算:它对编译器研究意味着什么?
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2692916.2558891
Norman Rubin
{"title":"Heterogeneous computing: what does it mean for compiler research?","authors":"Norman Rubin","doi":"10.1145/2692916.2558891","DOIUrl":"https://doi.org/10.1145/2692916.2558891","url":null,"abstract":"The current trend in computer architecture is to increase the number of cores, to create specialized types of cores within a single machine, and to network such machines together in very fluid web/cloud computing arrangements. Compilers have traditionally focused on optimizations to code that improve performance, but is that the right target to speed up real applications? Consider loading a web page (like starting GMAIL) the page is transferred to the client, any JavaScript is compiled, the JavaScript executes, and the page gets displayed. The classic compiler model (which was first developed in the late 50's) was a great fit for single core machines but has fallen behind architecture, and language. For example how do you compile a single program for a machine that has both a CPU and a graphics coprocessor (a GPU) with a very different programming and memory model? Together with the changes in architecture there have been changes in programming languages. Dynamic languages are used more, static languages are used less. How does this effect compiler research? In this talk, I'll review a number of traditional compiler research challenges that have (or will) become burning issues and will describe some new problems areas that were not considered in the past. For example language specifica-tions are large complex technical documents that are difficult for non-experts to follow. Application programmers are often not willing to read these documents; can a compiler bridge the gap?","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131678561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Theoretical analysis of classic algorithms on highly-threaded many-core GPUs 经典算法在高线程多核gpu上的理论分析
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555285
Lin Ma, Kunal Agrawal, R. Chamberlain
{"title":"Theoretical analysis of classic algorithms on highly-threaded many-core GPUs","authors":"Lin Ma, Kunal Agrawal, R. Chamberlain","doi":"10.1145/2555243.2555285","DOIUrl":"https://doi.org/10.1145/2555243.2555285","url":null,"abstract":"The Threaded many-core memory (TMM) model provides a framework to analyze the performance of algorithms on GPUs. Here, we investigate the effectiveness of the TMM model by analyzing algorithms for 3 classic problems -- suffix tree/array for string matching, fast Fourier transform, and merge sort -- under this model. Our findings indicate that the TMM model can explain and predict previously unexplained trends and artifacts in experimental data.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126837101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Data structures for task-based priority scheduling 基于任务优先级调度的数据结构
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-12-09 DOI: 10.1145/2555243.2555278
Martin Wimmer, F. Versaci, J. Träff, Daniel Cederman, P. Tsigas
{"title":"Data structures for task-based priority scheduling","authors":"Martin Wimmer, F. Versaci, J. Träff, Daniel Cederman, P. Tsigas","doi":"10.1145/2555243.2555278","DOIUrl":"https://doi.org/10.1145/2555243.2555278","url":null,"abstract":"We present three lock-free data structures for priority task scheduling: a priority work-stealing one, a centralized one with ρ-relaxed semantics, and a hybrid one combining both concepts. With the single-source shortest path (SSSP) problem as example, we show how the different approaches affect the prioritization and provide upper bounds on the number of examined nodes. We argue that priority task scheduling allows for an intuitive and easy way to parallelize the SSSP problem, notoriously a hard task. Experimental evidence supports the good scalability of the resulting algorithm. The larger aim of this work is to understand the trade-offs between scalability and priority guarantees in task scheduling systems. We show that ρ-relaxation is a valuable technique for improving the first, while still allowing semantic constraints to be satisfied: the lock-free, hybrid $k$-priority data structure can scale as well as work-stealing, while still providing strong priority scheduling guarantees, which depend on the parameter k. Our theoretical results open up possibilities for even more scalable data structures by adopting a weaker form of ρ-relaxation, which still enables the semantic constraints to be respected.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131049656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
From relational verification to SIMD loop synthesis 从关系验证到SIMD回路合成
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-02-23 DOI: 10.1145/2442516.2442529
G. Barthe, Juan Manuel Crespo, Sumit Gulwani, César Kunz, Mark Marron
{"title":"From relational verification to SIMD loop synthesis","authors":"G. Barthe, Juan Manuel Crespo, Sumit Gulwani, César Kunz, Mark Marron","doi":"10.1145/2442516.2442529","DOIUrl":"https://doi.org/10.1145/2442516.2442529","url":null,"abstract":"Existing pattern-based compiler technology is unable to effectively exploit the full potential of SIMD architectures. We present a new program synthesis based technique for auto-vectorizing performance critical innermost loops. Our synthesis technique is applicable to a wide range of loops, consistently produces performant SIMD code, and generates correctness proofs for the output code. The synthesis technique, which leverages existing work on relational verification methods, is a novel combination of deductive loop restructuring, synthesis condition generation and a new inductive synthesis algorithm for producing loop-free code fragments. The inductive synthesis algorithm wraps an optimized depth-first exploration of code sequences inside a CEGIS loop. Our technique is able to quickly produce SIMD implementations (up to 9 instructions in 0.12 seconds) for a wide range of fundamental looping structures. The resulting SIMD implementations outperform the original loops by 2.0x-3.7x.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116053388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
StreamScan: fast scan algorithms for GPUs without global barrier synchronization StreamScan:快速扫描算法的gpu没有全局屏障同步
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-02-23 DOI: 10.1145/2442516.2442539
Shengen Yan, Guoping Long, Yunquan Zhang
{"title":"StreamScan: fast scan algorithms for GPUs without global barrier synchronization","authors":"Shengen Yan, Guoping Long, Yunquan Zhang","doi":"10.1145/2442516.2442539","DOIUrl":"https://doi.org/10.1145/2442516.2442539","url":null,"abstract":"Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128513272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 76
Fast concurrent queues for x86 processors 用于x86处理器的快速并发队列
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-02-23 DOI: 10.1145/2442516.2442527
Adam Morrison, Y. Afek
{"title":"Fast concurrent queues for x86 processors","authors":"Adam Morrison, Y. Afek","doi":"10.1145/2442516.2442527","DOIUrl":"https://doi.org/10.1145/2442516.2442527","url":null,"abstract":"Conventional wisdom in designing concurrent data structures is to use the most powerful synchronization primitive, namely compare-and-swap (CAS), and to avoid contended hot spots. In building concurrent FIFO queues, this reasoning has led researchers to propose combining-based concurrent queues.\u0000 This paper takes a different approach, showing how to rely on fetch-and-add (F&A), a less powerful primitive that is available on x86 processors, to construct a nonblocking (lock-free) linearizable concurrent FIFO queue which, despite the F&A being a contended hot spot, outperforms combining-based implementations by 1.5x to 2.5x in all concurrency levels on an x86 server with four multicore processors, in both single-processor and multi-processor executions.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132122722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 138
ZOOMM: a parallel web browser engine for multicore mobile devices ZOOMM:用于多核移动设备的并行web浏览器引擎
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-02-23 DOI: 10.1145/2442516.2442543
Calin Cascaval, Seth Fowler, Pablo Montesinos, W. Piekarski, Mehrdad Reshadi, Behnam Robatmili, Michael Weber, Vrajesh Bhavsar
{"title":"ZOOMM: a parallel web browser engine for multicore mobile devices","authors":"Calin Cascaval, Seth Fowler, Pablo Montesinos, W. Piekarski, Mehrdad Reshadi, Behnam Robatmili, Michael Weber, Vrajesh Bhavsar","doi":"10.1145/2442516.2442543","DOIUrl":"https://doi.org/10.1145/2442516.2442543","url":null,"abstract":"We explore the challenges in expressing and managing concurrency in browsers on mobile devices. Browsers are complex applications that implement multiple standards, need to support legacy behavior, and are highly dynamic and interactive. We present ZOOMM, a highly concurrent web browser engine prototype and show how concurrency is effectively exploited at different levels: speed up computation performance, preload network resources, and preprocess resources outside the critical path of page loading. On a dual-core Android mobile device we demonstrate that ZOOMM is two times faster than the native WebKit based browser when loading the set of pages defined in the Vellamo benchmark.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132213843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信