Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming最新文献

筛选
英文 中文
VirtCL: a framework for OpenCL device abstraction and management 一个用于OpenCL设备抽象和管理的框架
Yi-Ping You, Hen-Jung Wu, Y. Tsai, Y. Chao
{"title":"VirtCL: a framework for OpenCL device abstraction and management","authors":"Yi-Ping You, Hen-Jung Wu, Y. Tsai, Y. Chao","doi":"10.1145/2688500.2688505","DOIUrl":"https://doi.org/10.1145/2688500.2688505","url":null,"abstract":"The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention- and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the time-consuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122936796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
The SprayList: a scalable relaxed priority queue SprayList:一个可扩展的放松优先级队列
Dan Alistarh, Justin Kopinsky, Jerry Li, N. Shavit
{"title":"The SprayList: a scalable relaxed priority queue","authors":"Dan Alistarh, Justin Kopinsky, Jerry Li, N. Shavit","doi":"10.1145/2688500.2688523","DOIUrl":"https://doi.org/10.1145/2688500.2688523","url":null,"abstract":"High-performance concurrent priority queues are essential for applications such as task scheduling and discrete event simulation. Unfortunately, even the best performing implementations do not scale past a number of threads in the single digits. This is because of the sequential bottleneck in accessing the elements at the head of the queue in order to perform a DeleteMin operation. In this paper, we present the SprayList, a scalable priority queue with relaxed ordering semantics. Starting from a non-blocking SkipList, the main innovation behind our design is that the DeleteMin operations avoid a sequential bottleneck by ``spraying'' themselves onto the head of the SkipList list in a coordinated fashion. The spraying is implemented using a carefully designed random walk, so that DeleteMin returns an element among the first O(p log^3 p) in the list, with high probability, where p is the number of threads. We prove that the running time of a DeleteMin operation is O(log^3 p), with high probability, independent of the size of the list. Our experiments show that the relaxed semantics allow the data structure to scale for high thread counts, comparable to a classic unordered SkipList. Furthermore, we observe that, for reasonably parallel workloads, the scalability benefits of relaxation considerably outweigh the additional work due to out-of-order execution.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126239567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 104
Low-overhead software transactional memory with progress guarantees and strong semantics 具有进度保证和强语义的低开销软件事务性内存
Minjia Zhang, Jipeng Huang, Man Cao, Michael D. Bond
{"title":"Low-overhead software transactional memory with progress guarantees and strong semantics","authors":"Minjia Zhang, Jipeng Huang, Man Cao, Michael D. Bond","doi":"10.1145/2688500.2688510","DOIUrl":"https://doi.org/10.1145/2688500.2688510","url":null,"abstract":"Software transactional memory offers an appealing alternative to locks by improving programmability, reliability, and scalability. However, existing STMs are impractical because they add high instrumentation costs and often provide weak progress guarantees and/or semantics. This paper introduces a novel STM called LarkTM that provides three significant features. (1) Its instrumentation adds low overhead except when accesses actually conflict, enabling low single-thread overhead and scaling well on low-contention workloads. (2) It uses eager concurrency control mechanisms, yet naturally supports flexible conflict resolution, enabling strong progress guarantees. (3) It naturally provides strong atomicity semantics at low cost. LarkTM's design works well for low-contention workloads, but adds significant overhead under higher contention, so we design an adaptive version of LarkTM that uses alternative concurrency control for high-contention objects. An implementation and evaluation in a Java virtual machine show that the basic and adaptive versions of LarkTM not only provide low single-thread overhead, but their multithreaded performance compares favorably with existing high-performance STMs.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130252280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Predicate RCU: an RCU for scalable concurrent updates 谓词RCU:用于扩展并发更新的RCU
M. Arbel, Adam Morrison
{"title":"Predicate RCU: an RCU for scalable concurrent updates","authors":"M. Arbel, Adam Morrison","doi":"10.1145/2688500.2688518","DOIUrl":"https://doi.org/10.1145/2688500.2688518","url":null,"abstract":"Read-copy update (RCU) is a shared memory synchronization mechanism with scalable synchronization-free reads that nevertheless execute correctly with concurrent updates. To guarantee the consistency of such reads, an RCU update transitioning the data structure between certain states must wait for the completion of all existing reads. Unfortunately, these waiting periods quickly become a bottleneck, and thus RCU remains unused in data structures that require scalable, fine-grained, update operations. To solve this problem, we present Predicate RCU (PRCU), an RCU variant in which an update waits only for the reads whose consistency it affects, which are specified by a user-supplied predicate. We explore the trade-offs in implementing PRCU, describing implementations that reduce wait times by 10--100x with varying overhead on reads on modern x86 multiprocessor machines. We demonstrate the applicability of PRCU by applying it to two RCU-based concurrent algorithms---the Citrus binary search tree and a resizable hash table---and show experimentally that PRCU significantly improves the performance of both algorithms.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132755937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
A library for portable and composable data locality optimizations for NUMA systems 一个用于NUMA系统的可移植和可组合数据局部性优化的库
Z. Majó, T. Gross
{"title":"A library for portable and composable data locality optimizations for NUMA systems","authors":"Z. Majó, T. Gross","doi":"10.1145/2688500.2688509","DOIUrl":"https://doi.org/10.1145/2688500.2688509","url":null,"abstract":"Many recent multiprocessor systems are realized with a non-uniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not~portable, and (3) optimizations are not~composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software). This paper presents TBB-NUMA, a parallel programming library based on Intel Threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer's insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB, but more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133127589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
High performance locks for multi-level NUMA systems 用于多级NUMA系统的高性能锁
Milind Chabbi, M. Fagan, J. Mellor-Crummey
{"title":"High performance locks for multi-level NUMA systems","authors":"Milind Chabbi, M. Fagan, J. Mellor-Crummey","doi":"10.1145/2688500.2688503","DOIUrl":"https://doi.org/10.1145/2688500.2688503","url":null,"abstract":"Efficient locking mechanisms are critically important for high performance computers. On highly-threaded systems with a deep memory hierarchy, the throughput of traditional queueing locks, e.g., MCS locks, falls off due to NUMA effects. Two-level cohort locks perform better on NUMA systems, but fail to deliver top performance for deep NUMA hierarchies. In this paper, we describe a hierarchical variant of the MCS lock that adapts the principles of cohort locking for architectures with deep NUMA hierarchies. We describe analytical models for throughput and fairness of Cohort-MCS (C-MCS) and Hierarchical MCS (HMCS) locks that enable us to tailor these locks for high performance on any target platform without empirical tuning. Using these models, one can select parameters such that an HMCS lock will deliver better fairness than a C-MCS lock for a given throughput, or deliver better throughput for a given fairness. Our experiments show that, under high contention, a three-level HMCS lock delivers up to 7.6x higher lock throughput than a C-MCS lock on a 128-thread IBM Power 755 and a five-level HMCS lock delivers up to 72x higher lock throughput on a 4096-thread SGI UV 1000. On the K-means clustering code from the MineBench suit, a three-level HMCS lock reduces the running time by up to 55% compared to the C-MCS lock on a IBM Power 755.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122179871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Tiles: a new language mechanism for heterogeneous parallelism Tiles:异构并行的一种新的语言机制
Yifeng Chen, Xiang Cui, Hong Mei
{"title":"Tiles: a new language mechanism for heterogeneous parallelism","authors":"Yifeng Chen, Xiang Cui, Hong Mei","doi":"10.1145/2688500.2688555","DOIUrl":"https://doi.org/10.1145/2688500.2688555","url":null,"abstract":"This paper studies the essence of heterogeneity from the perspective of language mechanism design. The proposed mechanism, called tiles, is a program construct that bridges two relative levels of computation: an outer level of source data in larger, slower or more distributed memory and an inner level of data blocks in smaller, faster or more localized memory.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127893646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
JAWS: a JavaScript framework for adaptive CPU-GPU work sharing JAWS:用于自适应CPU-GPU工作共享的JavaScript框架
Xianglan Piao, Channoh Kim, Young H. Oh, Huiying Li, Jin-Chul Kim, Hanjun Kim, Jae W. Lee
{"title":"JAWS: a JavaScript framework for adaptive CPU-GPU work sharing","authors":"Xianglan Piao, Channoh Kim, Young H. Oh, Huiying Li, Jin-Chul Kim, Hanjun Kim, Jae W. Lee","doi":"10.1145/2688500.2688525","DOIUrl":"https://doi.org/10.1145/2688500.2688525","url":null,"abstract":"This paper introduces jAWS, a JavaScript framework for adaptive work sharing between CPU and GPU for data-parallel workloads. Unlike conventional heterogeneous parallel programming environments for JavaScript, which use only one compute device when executing a single kernel, jAWS accelerates kernel execution by exploiting both devices to realize full performance potential of heterogeneous multicores. jAWS employs an efficient work partitioning algorithm that finds an optimal work distribution between the two devices without requiring offline profiling. The jAWS runtime provides shared arrays for multiple parallel contexts, hence eliminating extra copy overhead for input and output data. Our preliminary evaluation with both CPU-friendly and GPU-friendly benchmarks demonstrates that jAWS provides good load balancing and efficient data communication between parallel contexts, to significantly outperform best single-device execution.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"20 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114112995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
PLUTO+: near-complete modeling of affine transformations for parallelism and locality 冥王星+:近乎完整的仿射变换的并行性和局部性建模
Aravind Acharya, Uday Bondhugula
{"title":"PLUTO+: near-complete modeling of affine transformations for parallelism and locality","authors":"Aravind Acharya, Uday Bondhugula","doi":"10.1145/2688500.2688512","DOIUrl":"https://doi.org/10.1145/2688500.2688512","url":null,"abstract":"Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler transformations. Existing affine transformation frameworks like the Pluto algorithm, that include a cost function for modern multicore architectures where coarse-grained parallelism and locality are crucial, consider only a sub-space of transformations to avoid a combinatorial explosion in finding the transformations. The ensuing practical trade-offs lead to the exclusion of certain useful transformations, in particular, transformation compositions involving loop reversals and loop skewing by negative factors. In this paper, we propose an approach to address this limitation by modeling a much larger space of affine transformations in conjunction with the Pluto algorithm's cost function. We perform an experimental evaluation of both, the effect on compilation time, and performance of generated codes. The evaluation shows that our new framework, Pluto+, provides no degradation in performance in any of the Polybench benchmarks. For Lattice Boltzmann Method (LBM) codes with periodic boundary conditions, it provides a mean speedup of 1.33x over Pluto. We also show that Pluto+ does not increase compile times significantly. Experimental results on Polybench show that Pluto+ increases overall polyhedral source-to-source optimization time only by 15%. In cases where it improves execution time significantly, it increased polyhedral optimization time only by 2.04x.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122934769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly 三维非结构化网格计算的可扩展和高效实现:矩阵装配的案例研究
Loïc Thébault, E. Petit, Quang Dinh
{"title":"Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly","authors":"Loïc Thébault, E. Petit, Quang Dinh","doi":"10.1145/2688500.2688517","DOIUrl":"https://doi.org/10.1145/2688500.2688517","url":null,"abstract":"Exposing massive parallelism on 3D unstructured meshes computation with efficient load balancing and minimal synchronizations is challenging. Current approaches relying on domain decomposition and mesh coloring struggle to scale with the increasing number of cores per nodes, especially with new many-core processors. In this paper, we propose an hybrid approach using domain decomposition to exploit distributed memory parallelism, Divide-and-Conquer, D&C, to exploit shared memory parallelism and improve locality, and mesh coloring at core level to exploit vectors. It illustrates a new trade-off for many-cores between structuredness, memory locality, and vectorization. We evaluate our approach on the finite element matrix assembly of an industrial fluid dynamic code developed by Dassault Aviation. We compare our D&C approach to domain decomposition and to mesh coloring. D&C achieves a high parallel efficiency, a good data locality as well as an improved bandwidth usage. It competes on current nodes with the optimized pure MPI version with a minimum 10% speed-up. D&C shows an impressive 319x strong scaling on 512 cores (32 nodes) with only 2000 vertices per core. Finally, the Intel Xeon Phi version has a performance similar to 10 Intel E5-2665 Xeon Sandy Bridge cores and 95% parallel efficiency on the 60 physical cores. Running on 4 Xeon Phi (240 cores), D&C has 92% efficiency on the physical cores and performance similar to 33 Intel E5-2665 Xeon Sandy Bridge cores.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124545225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信