Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming最新文献

VirtCL: a framework for OpenCL device abstraction and management 一个用于OpenCL设备抽象和管理的框架

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI: 10.1145/2688500.2688505

Yi-Ping You, Hen-Jung Wu, Y. Tsai, Y. Chao

{"title":"VirtCL: a framework for OpenCL device abstraction and management","authors":"Yi-Ping You, Hen-Jung Wu, Y. Tsai, Y. Chao","doi":"10.1145/2688500.2688505","DOIUrl":"https://doi.org/10.1145/2688500.2688505","url":null,"abstract":"The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention- and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the time-consuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122936796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

The SprayList: a scalable relaxed priority queue SprayList:一个可扩展的放松优先级队列

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI: 10.1145/2688500.2688523

Dan Alistarh, Justin Kopinsky, Jerry Li, N. Shavit

{"title":"The SprayList: a scalable relaxed priority queue","authors":"Dan Alistarh, Justin Kopinsky, Jerry Li, N. Shavit","doi":"10.1145/2688500.2688523","DOIUrl":"https://doi.org/10.1145/2688500.2688523","url":null,"abstract":"High-performance concurrent priority queues are essential for applications such as task scheduling and discrete event simulation. Unfortunately, even the best performing implementations do not scale past a number of threads in the single digits. This is because of the sequential bottleneck in accessing the elements at the head of the queue in order to perform a DeleteMin operation. In this paper, we present the SprayList, a scalable priority queue with relaxed ordering semantics. Starting from a non-blocking SkipList, the main innovation behind our design is that the DeleteMin operations avoid a sequential bottleneck by ``spraying'' themselves onto the head of the SkipList list in a coordinated fashion. The spraying is implemented using a carefully designed random walk, so that DeleteMin returns an element among the first O(p log^3 p) in the list, with high probability, where p is the number of threads. We prove that the running time of a DeleteMin operation is O(log^3 p), with high probability, independent of the size of the list. Our experiments show that the relaxed semantics allow the data structure to scale for high thread counts, comparable to a classic unordered SkipList. Furthermore, we observe that, for reasonably parallel workloads, the scalability benefits of relaxation considerably outweigh the additional work due to out-of-order execution.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126239567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 104

Low-overhead software transactional memory with progress guarantees and strong semantics 具有进度保证和强语义的低开销软件事务性内存

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI: 10.1145/2688500.2688510

Minjia Zhang, Jipeng Huang, Man Cao, Michael D. Bond

{"title":"Low-overhead software transactional memory with progress guarantees and strong semantics","authors":"Minjia Zhang, Jipeng Huang, Man Cao, Michael D. Bond","doi":"10.1145/2688500.2688510","DOIUrl":"https://doi.org/10.1145/2688500.2688510","url":null,"abstract":"Software transactional memory offers an appealing alternative to locks by improving programmability, reliability, and scalability. However, existing STMs are impractical because they add high instrumentation costs and often provide weak progress guarantees and/or semantics. This paper introduces a novel STM called LarkTM that provides three significant features. (1) Its instrumentation adds low overhead except when accesses actually conflict, enabling low single-thread overhead and scaling well on low-contention workloads. (2) It uses eager concurrency control mechanisms, yet naturally supports flexible conflict resolution, enabling strong progress guarantees. (3) It naturally provides strong atomicity semantics at low cost. LarkTM's design works well for low-contention workloads, but adds significant overhead under higher contention, so we design an adaptive version of LarkTM that uses alternative concurrency control for high-contention objects. An implementation and evaluation in a Java virtual machine show that the basic and adaptive versions of LarkTM not only provide low single-thread overhead, but their multithreaded performance compares favorably with existing high-performance STMs.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130252280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Predicate RCU: an RCU for scalable concurrent updates 谓词RCU:用于扩展并发更新的RCU

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI: 10.1145/2688500.2688518

M. Arbel, Adam Morrison

引用次数: 30

A library for portable and composable data locality optimizations for NUMA systems 一个用于NUMA系统的可移植和可组合数据局部性优化的库

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI: 10.1145/2688500.2688509

Z. Majó, T. Gross

{"title":"A library for portable and composable data locality optimizations for NUMA systems","authors":"Z. Majó, T. Gross","doi":"10.1145/2688500.2688509","DOIUrl":"https://doi.org/10.1145/2688500.2688509","url":null,"abstract":"Many recent multiprocessor systems are realized with a non-uniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not~portable, and (3) optimizations are not~composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software). This paper presents TBB-NUMA, a parallel programming library based on Intel Threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer's insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB, but more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133127589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

High performance locks for multi-level NUMA systems 用于多级NUMA系统的高性能锁

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI: 10.1145/2688500.2688503

Milind Chabbi, M. Fagan, J. Mellor-Crummey

{"title":"High performance locks for multi-level NUMA systems","authors":"Milind Chabbi, M. Fagan, J. Mellor-Crummey","doi":"10.1145/2688500.2688503","DOIUrl":"https://doi.org/10.1145/2688500.2688503","url":null,"abstract":"Efficient locking mechanisms are critically important for high performance computers. On highly-threaded systems with a deep memory hierarchy, the throughput of traditional queueing locks, e.g., MCS locks, falls off due to NUMA effects. Two-level cohort locks perform better on NUMA systems, but fail to deliver top performance for deep NUMA hierarchies. In this paper, we describe a hierarchical variant of the MCS lock that adapts the principles of cohort locking for architectures with deep NUMA hierarchies. We describe analytical models for throughput and fairness of Cohort-MCS (C-MCS) and Hierarchical MCS (HMCS) locks that enable us to tailor these locks for high performance on any target platform without empirical tuning. Using these models, one can select parameters such that an HMCS lock will deliver better fairness than a C-MCS lock for a given throughput, or deliver better throughput for a given fairness. Our experiments show that, under high contention, a three-level HMCS lock delivers up to 7.6x higher lock throughput than a C-MCS lock on a 128-thread IBM Power 755 and a five-level HMCS lock delivers up to 72x higher lock throughput on a 4096-thread SGI UV 1000. On the K-means clustering code from the MineBench suit, a three-level HMCS lock reduces the running time by up to 55% compared to the C-MCS lock on a IBM Power 755.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122179871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

Tiles: a new language mechanism for heterogeneous parallelism Tiles:异构并行的一种新的语言机制

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI: 10.1145/2688500.2688555

Yifeng Chen, Xiang Cui, Hong Mei

引用次数: 1

JAWS: a JavaScript framework for adaptive CPU-GPU work sharing JAWS:用于自适应CPU-GPU工作共享的JavaScript框架

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI: 10.1145/2688500.2688525

Xianglan Piao, Channoh Kim, Young H. Oh, Huiying Li, Jin-Chul Kim, Hanjun Kim, Jae W. Lee

引用次数: 6

PLUTO+: near-complete modeling of affine transformations for parallelism and locality 冥王星+:近乎完整的仿射变换的并行性和局部性建模

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI: 10.1145/2688500.2688512

Aravind Acharya, Uday Bondhugula

{"title":"PLUTO+: near-complete modeling of affine transformations for parallelism and locality","authors":"Aravind Acharya, Uday Bondhugula","doi":"10.1145/2688500.2688512","DOIUrl":"https://doi.org/10.1145/2688500.2688512","url":null,"abstract":"Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler transformations. Existing affine transformation frameworks like the Pluto algorithm, that include a cost function for modern multicore architectures where coarse-grained parallelism and locality are crucial, consider only a sub-space of transformations to avoid a combinatorial explosion in finding the transformations. The ensuing practical trade-offs lead to the exclusion of certain useful transformations, in particular, transformation compositions involving loop reversals and loop skewing by negative factors. In this paper, we propose an approach to address this limitation by modeling a much larger space of affine transformations in conjunction with the Pluto algorithm's cost function. We perform an experimental evaluation of both, the effect on compilation time, and performance of generated codes. The evaluation shows that our new framework, Pluto+, provides no degradation in performance in any of the Polybench benchmarks. For Lattice Boltzmann Method (LBM) codes with periodic boundary conditions, it provides a mean speedup of 1.33x over Pluto. We also show that Pluto+ does not increase compile times significantly. Experimental results on Polybench show that Pluto+ increases overall polyhedral source-to-source optimization time only by 15%. In cases where it improves execution time significantly, it increased polyhedral optimization time only by 2.04x.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122934769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly 三维非结构化网格计算的可扩展和高效实现:矩阵装配的案例研究

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI: 10.1145/2688500.2688517

Loïc Thébault, E. Petit, Quang Dinh

{"title":"Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly","authors":"Loïc Thébault, E. Petit, Quang Dinh","doi":"10.1145/2688500.2688517","DOIUrl":"https://doi.org/10.1145/2688500.2688517","url":null,"abstract":"Exposing massive parallelism on 3D unstructured meshes computation with efficient load balancing and minimal synchronizations is challenging. Current approaches relying on domain decomposition and mesh coloring struggle to scale with the increasing number of cores per nodes, especially with new many-core processors. In this paper, we propose an hybrid approach using domain decomposition to exploit distributed memory parallelism, Divide-and-Conquer, D&C, to exploit shared memory parallelism and improve locality, and mesh coloring at core level to exploit vectors. It illustrates a new trade-off for many-cores between structuredness, memory locality, and vectorization. We evaluate our approach on the finite element matrix assembly of an industrial fluid dynamic code developed by Dassault Aviation. We compare our D&C approach to domain decomposition and to mesh coloring. D&C achieves a high parallel efficiency, a good data locality as well as an improved bandwidth usage. It competes on current nodes with the optimized pure MPI version with a minimum 10% speed-up. D&C shows an impressive 319x strong scaling on 512 cores (32 nodes) with only 2000 vertices per core. Finally, the Intel Xeon Phi version has a performance similar to 10 Intel E5-2665 Xeon Sandy Bridge cores and 95% parallel efficiency on the 60 physical cores. Running on 4 Xeon Phi (240 cores), D&C has 92% efficiency on the physical cores and performance similar to 33 Intel E5-2665 Xeon Sandy Bridge cores.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124545225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10