Proceedings of Workshop on General Purpose Processing Using GPUs最新文献

Efficient Instrumentation of GPGPU Applications Using Information Flow Analysis and Symbolic Execution 利用信息流分析和符号执行实现GPGPU应用程序的高效检测

Proceedings of Workshop on General Purpose Processing Using GPUs Pub Date : 2014-03-01 DOI: 10.1145/2588768.2576782

N. Farooqui, K. Schwan, S. Yalamanchili

引用次数: 11

Power Modeling for Heterogeneous Processors 异构处理器的功率建模

Proceedings of Workshop on General Purpose Processing Using GPUs Pub Date : 2014-03-01 DOI: 10.1145/2588768.2576790

Tahir Diop, Natalie D. Enright Jerger, J. Anderson

引用次数: 14

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap:用于非对称多核处理器的高效堆数据结构

Proceedings of Workshop on General Purpose Processing Using GPUs Pub Date : 2014-03-01 DOI: 10.1145/2588768.2576786

Weifeng Liu, B. Vinter

{"title":"ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors","authors":"Weifeng Liu, B. Vinter","doi":"10.1145/2588768.2576786","DOIUrl":"https://doi.org/10.1145/2588768.2576786","url":null,"abstract":"Heap is one of the most important fundamental data structures in computer science. Unfortunately, for a long time heaps did not obtain ideal performance gain from widely used throughput-oriented processors because of two reasons: (1) heap property decides that operations between any parent node and its child nodes must be executed sequentially, and (2) heaps, even d-heaps (d-ary heaps or d-way heaps), cannot supply enough wide data parallelism to these processors. Recent research proposed more versatile asymmetric multicore processors (AMPs) that consist of two types of cores (latency-oriented cores with high single-thread performance and throughput-oriented cores with wide vector processing capability), unified memory address space and faster synchronization mechanism among cores with different ISAs. To leverage the AMPs for the heap data structure, in this paper we propose ad-heap, an efficient heap data structure that introduces an implicit bridge structure and properly apportions workloads to the two types of cores. We implement a batch k-selection algorithm and conduct experiments on simulated AMP environments composed of real CPUs and GPUs. In our experiments on two representative platforms, the ad-heap obtains up to 1.5x and 3.6x speedup over the optimal AMP scheduling method that executes the fastest d-heaps on the standalone CPUs and GPUs in parallel.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"426 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132060592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs gpu上可互操作图形和计算的性能评估和优化机制

Proceedings of Workshop on General Purpose Processing Using GPUs Pub Date : 2014-03-01 DOI: 10.1145/2588768.2576784

Yash Ukidave, Xiang Gong, D. Kaeli

{"title":"Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs","authors":"Yash Ukidave, Xiang Gong, D. Kaeli","doi":"10.1145/2588768.2576784","DOIUrl":"https://doi.org/10.1145/2588768.2576784","url":null,"abstract":"Graphics Processing Units (GPUs) have gained recognition as the primary form of accelerators for graphics rendering in the gaming domain. They have also been widely accepted as the computing platform of choice in many scientific and high performance computing domains. The parallelism offered by the GPUs is used for simultaneous processing of compute and graphics by applications belonging to a range of domains. The availability of programming standards such as OpenCL and OpenGL has been leveraged to achieve the compute-graphics interoperability in the same application. However, given the increasing demands in both compute and graphics for emerging scientific visualization and immersive gaming applications, degradation in efficiency can be seen due to the continual switching between compute/graphics, swapping in and out of their associated runtime environments. We need to better understand how to tune this interoperable environment in order to allow compute and graphics to run both efficiently and simultaneously. Presently we evaluate each of these domains in isolation. In this paper, we evaluate the performance and efficiency of the OpenCL-OpenGL(CL-GL) interoperability(interop) mode. We explore different methods to improve the execution performance of the CL-GL interop-based applications. We propose a slot-based rendering mechanism for CL-GL interop to increase the efficiency of the application. To evaluate CL-GL and our slot-based scheme, we study five scientific applications using OpenCL and OpenGL for compute and graphics rendering. Our study covers two AMD Radeon discrete GPUs and one shared memory AMD APU as test platforms. We demonstrate that leveraging the CL-GL interop interface results in a 2.2X performance increase, and our slot-based rendering provides 60% increase in performance by providing a 24% improvement in L2 cache hit rate on GPUs and APUs.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126103363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Exploiting GPU Hardware Saturation for Fast Compiler Optimization 利用GPU硬件饱和快速编译优化

Proceedings of Workshop on General Purpose Processing Using GPUs Pub Date : 2014-03-01 DOI: 10.1145/2588768.2576791

A. Magni, Christophe Dubach, M. O’Boyle

{"title":"Exploiting GPU Hardware Saturation for Fast Compiler Optimization","authors":"A. Magni, Christophe Dubach, M. O’Boyle","doi":"10.1145/2588768.2576791","DOIUrl":"https://doi.org/10.1145/2588768.2576791","url":null,"abstract":"Graphics Processing Units (GPUs) are efficient devices capable of delivering high performance for general purpose computation. Realizing their full performance potential often requires extensive compiler tuning. This process is particularly expensive since it has to be repeated for each target program and platform. In this paper we study the utilization of GPU hardware resources across multiple input sizes and compiler options. In this context we introduce the notion of hardware saturation. Saturation is reached when an application is executed with a number of threads large enough to fully utilize the available hardware resources. We give experimental evidence of hardware saturation and describe its properties using 16 OpenCL kernels on 3 GPUs from Nvidia and AMD. We show that input sizes that saturates the GPU show performance stability across compiler transformations. Using the thread-coarsening transformation as an example, we show that compiler settings maintain their relative performance across input sizes within the saturation region. Leveraging these hardware and software properties we propose a technique to identify the input size at the lower bound of the saturation zone, we call it Minimum Saturation Point (MSP). By performing iterative compilation on the MSP input size we obtain results effectively applicable for much large input problems reducing the overhead of tuning by an order of magnitude on average.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130310101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

KMA: A Dynamic Memory Manager for OpenCL KMA: OpenCL动态内存管理器

Proceedings of Workshop on General Purpose Processing Using GPUs Pub Date : 2014-03-01 DOI: 10.1145/2588768.2576781

R. Spliet, Lee W. Howes, Benedict R. Gaster, A. Varbanescu

{"title":"KMA: A Dynamic Memory Manager for OpenCL","authors":"R. Spliet, Lee W. Howes, Benedict R. Gaster, A. Varbanescu","doi":"10.1145/2588768.2576781","DOIUrl":"https://doi.org/10.1145/2588768.2576781","url":null,"abstract":"OpenCL is becoming a popular choice for the parallel programming of both multi-core CPUs and GPGPUs. One of the features missing in OpenCL, yet commonly required in irregular parallel applications, is dynamic memory allocation. In this paper, we propose KMA, a first dynamic memory allocator for OpenCL. KMA's design is based on a thorough analysis of a set of 11 algorithms, which shows that dynamic memory allocation is a necessary commodity, typically used for implementing complex data structures (arrays, lists, trees) that need constant restructuring at runtime. Taking into account both the survey findings and the status-quo of OpenCL, we design KMA as a two-layer memory manager that makes smart use of the patterns we identified in our application analysis: its basic functionality provides generic malloc() and free() APIs, while the higher layer provides support for building and efficiently managing dynamic data structures. Our experiments measure the performance and usability of KMA, using both microbenchmarks and a real-life case-study. Results show that when dynamic allocation is mandatory, KMA is a competitive allocator. We conclude that embedding dynamic memory allocation in OpenCL is feasible, but it is a complex, delicate task due to the massive parallelism of the platform and the portability issues between different OpenCL implementations.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125702972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method 快速多极方法的CPU与GPU混合实现与模型驱动调度

Proceedings of Workshop on General Purpose Processing Using GPUs Pub Date : 2014-03-01 DOI: 10.1145/2588768.2576787

JeeWhan Choi, Aparna Chandramowlishwaran, Kamesh Madduri, R. Vuduc

引用次数: 22

ParallelJS: An Execution Framework for JavaScript on Heterogeneous Systems ParallelJS:异构系统上的JavaScript执行框架

Proceedings of Workshop on General Purpose Processing Using GPUs Pub Date : 2014-03-01 DOI: 10.1145/2588768.2576788

Jin Wang, Norman Rubin, S. Yalamanchili

引用次数: 9

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications 应用感知内存系统的公平和有效的执行并发GPGPU应用

Proceedings of Workshop on General Purpose Processing Using GPUs Pub Date : 2014-03-01 DOI: 10.1145/2588768.2576780

Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, S. Keckler, M. Kandemir, C. Das

{"title":"Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications","authors":"Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, S. Keckler, M. Kandemir, C. Das","doi":"10.1145/2588768.2576780","DOIUrl":"https://doi.org/10.1145/2588768.2576780","url":null,"abstract":"The available computing resources in modern GPUs are growing with each new generation. However, as many general purpose applications with limited thread-scalability are tuned to take advantage of GPUs, available compute resources might not be optimally utilized. To address this, modern GPUs will need to execute multiple kernels simultaneously. As current generations of GPUs (e.g., NVIDIA Kepler, AMD Radeon) already enable concurrent execution of kernels from the same application, in this paper we address the next logical step: executing multiple concurrent applications in GPUs. We show that while this paradigm has a potential to improve the overall system performance, negative interactions among concurrently executing applications in the memory system can severely hamper the performance and fairness among applications. We show that the current application agnostic GPU memory system design can (1) lead to sub-optimal GPU performance; and (2) create significant imbalance in performance slowdowns across kernels. Thus, we argue that GPU memory system should be augmented with application awareness. As one example to the applicability of this concept, we augment the memory system hardware with application awareness such that requests from different applications can be scheduled in a round robin (RR) fashion while still preserving the benefits of the current first-ready FCFS (FR-FCFS) memory scheduling policy. Evaluations with different multi-application workloads demonstrate that the proposed memory scheduling policy, first-ready round-robin FCFS (FR-RR-FCFS), improves fairness and delivers better system performance compared to the existing FR-FCFS memory scheduling scheme.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133208434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

Measuring GPU Power with the K20 Built-in Sensor 使用K20内置传感器测量GPU功耗

Proceedings of Workshop on General Purpose Processing Using GPUs Pub Date : 2014-03-01 DOI: 10.1145/2588768.2576783

Martin Burtscher, I. Zecena, Ziliang Zong

引用次数: 74