GPGPU-5最新文献 - Book学术

Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons 介绍'Bones':一个基于算法骨架的并行源码到源码编译器

GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159431

C. Nugteren, H. Corporaal

{"title":"Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons","authors":"C. Nugteren, H. Corporaal","doi":"10.1145/2159430.2159431","DOIUrl":"https://doi.org/10.1145/2159430.2159431","url":null,"abstract":"Recent advances in multi-core and many-core processors requires programmers to exploit an increasing amount of parallelism from their applications. Data parallel languages such as CUDA and OpenCL make it possible to take advantage of such processors, but still require a large amount of effort from programmers.\u0000 A number of parallelizing source-to-source compilers have recently been developed to ease programming of multi-core and many-core processors. This work presents and evaluates a number of such tools, focused in particular on C-to-CUDA transformations targeting GPUs. We compare these tools both qualitatively and quantitatively to each other and identify their strengths and weaknesses.\u0000 In this paper, we address the weaknesses by presenting a new classification of algorithms. This classification is used in a new source-to-source compiler, which is based on the algorithmic skeletons technique. The compiler generates target code based on skeletons of parallel structures, which can be seen as parameterisable library implementations for a set of algorithm classes. We furthermore demonstrate that the presented compiler requires little modifications to the original sequential source code, generates readable code for further fine-tuning, and delivers superior performance compared to other tools for a set of 8 image processing kernels.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125841002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

High-performance sparse matrix-vector multiplication on GPUs for structured grid computations 基于gpu的结构化网格计算的高性能稀疏矩阵向量乘法

GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159436

J. Godwin, Justin Holewinski, P. Sadayappan

{"title":"High-performance sparse matrix-vector multiplication on GPUs for structured grid computations","authors":"J. Godwin, Justin Holewinski, P. Sadayappan","doi":"10.1145/2159430.2159436","DOIUrl":"https://doi.org/10.1145/2159430.2159436","url":null,"abstract":"In this paper, we address efficient sparse matrix-vector multiplication for matrices arising from structured grid problems with high degrees of freedom at each grid node. Sparse matrix-vector multiplication is a critical step in the iterative solution of sparse linear systems of equations arising in the solution of partial differential equations using uniform grids for discretization. With uniform grids, the resulting linear system Ax = b has a matrix A that is sparse with a very regular structure. The specific focus of this paper is on sparse matrices that have a block structure due to the large number of unknowns at each grid point. Sparse matrix storage formats such as Compressed Sparse Row (CSR) and Diagonal format (DIA) are not the most effective for such matrices.\u0000 In this work, we present a new sparse matrix storage format that takes advantage of the diagonal structure of matrices for stencil operations on structured grids. Unlike other formats such as the Diagonal storage format (DIA), we specifically optimize for the case of higher degrees of freedom, where formats such as DIA are forced to explicitly represent many zero elements in the sparse matrix. We develop efficient sparse matrix-vector multiplication for structured grid computations on GPU architectures using CUDA [25].","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127912703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

A distributed data-parallel framework for analysis and visualization algorithm development 用于分析和可视化算法开发的分布式数据并行框架

GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159432

J. Meredith, R. Sisneros, D. Pugmire, Sean Ahern

引用次数: 16

Full system simulation of many-core heterogeneous SoCs using GPU and QEMU semihosting 基于GPU和QEMU半托管的多核异构soc全系统仿真

GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159442

Shivani Raghav, A. Marongiu, Christian Pinto, David Atienza Alonso, M. Ruggiero, L. Benini

{"title":"Full system simulation of many-core heterogeneous SoCs using GPU and QEMU semihosting","authors":"Shivani Raghav, A. Marongiu, Christian Pinto, David Atienza Alonso, M. Ruggiero, L. Benini","doi":"10.1145/2159430.2159442","DOIUrl":"https://doi.org/10.1145/2159430.2159442","url":null,"abstract":"Modern system-on-chips are evolving towards complex and heterogeneous platforms with general purpose processors coupled with massively parallel manycore accelerator fabrics (e.g. embedded GPUs). Platform developers are looking for efficient full-system simulators capable of simulating complex applications, middleware and operating systems on these heterogeneous targets. Unfortunately current virtual platforms are not able to tackle the complexity and heterogeneity of state-of-the-art SoCs. Software emulators, such as the open-source QEMU project, cope quite well in terms of simulation speed and functional accuracy with homogeneous coarse-grained multi-cores. The main contribution of this paper is the introduction of a novel virtual prototyping technique which exploits the heterogeneous accelerators available in commodity PCs to tackle the heterogeneity challenge in full-SoC system simulation. In a nutshell, our approach makes it possible to partition simulation between the host CPU and GPU. More specifically, QEMU runs on the host CPU and the simulation of manycore accelerators is offloaded, through semi-hosting, to the host GPU. Our experimental results confirm the flexibility and efficiency of our enhanced QEMU environment.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115663403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

A GPU-based high-throughput image retrieval algorithm 一种基于gpu的高吞吐量图像检索算法

GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159434

Feiwen Zhu, Peng Chen, Donglei Yang, Weihua Zhang, Haibo Chen, B. Zang

{"title":"A GPU-based high-throughput image retrieval algorithm","authors":"Feiwen Zhu, Peng Chen, Donglei Yang, Weihua Zhang, Haibo Chen, B. Zang","doi":"10.1145/2159430.2159434","DOIUrl":"https://doi.org/10.1145/2159430.2159434","url":null,"abstract":"With the development of Internet and cloud computing, multimedia data, such as images and videos, has become one of the most common data types being processed. As the scale of multimedia data being still increasing, it is vitally important to efficiently extract useful information from such a huge amount of multimedia data. However, due to the complexity of the core algorithms, multimedia retrieval applications are not only data intensive but also computationally intensive. Therefore, it has been a major challenge to accelerate the processing speed of such applications to satisfy the real-time requirement.\u0000 As Graphic Processing Unit (GPU) has entered the general-propose computing domain (GPGPU), it has become one of the most popular accelerators for the applications with real-time requirements. In this paper, we parallelize a widely-used image retrieval algorithm called SURF on GPGPU, which is the core algorithm for many video and image retrieval applications. We first analyze the parallelism within SURF to guarantee that there are sufficient tasks being mapped to the large-scale computation resources in GPGPU. We then exploit some inherent GPGPU characteristics, such as 2D memory, to further boost the performance. Finally, we provide some optimization to the cooperation between CPU and GPGPU, which is generally ignored in previous designs. Experimental results show that our parallelization and optimization achieve a throughput of 340.5 frames/s on a NVIDIA GTX295 GPGPU, which is 15X faster than the maximal optimized CPU version. Compared to CUDA SURF, a state-of-the-art parallelization of SURF on GPGPU, our system achieves a speedup by a factor of 2.3X.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123683229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Dynamic particle system for mesh extraction on the GPU 动态粒子系统的网格提取在GPU上

GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159435

Mark Kim, Guoning Chen, C. Hansen

引用次数: 7

Enabling task-level scheduling on heterogeneous platforms 在异构平台上启用任务级调度

GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159440

Enqiang Sun, Dana Schaa, Richard Bagley, Norman Rubin, D. Kaeli

{"title":"Enabling task-level scheduling on heterogeneous platforms","authors":"Enqiang Sun, Dana Schaa, Richard Bagley, Norman Rubin, D. Kaeli","doi":"10.1145/2159430.2159440","DOIUrl":"https://doi.org/10.1145/2159430.2159440","url":null,"abstract":"OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices, regardless of vendor. While OpenCL has been widely adopted, there still remains a lack of support for automatic task scheduling and data consistency when multiple devices appear in the system. To address this need, we have designed a task queueing extension for OpenCL that provides a high-level, unified execution model tightly coupled with a resource management facility. The main motivation for developing this extension is to provide OpenCL programmers with a convenient programming paradigm to fully utilize all possible devices in a system and incorporate flexible scheduling schemes. To demonstrate the value and utility of this extension, we have utilized an advanced OpenCL-based imaging toolkit called clSURF. Using our task queueing extension, we demonstrate the potential performance opportunities and limitations given current vendor implementations of OpenCL. Using a state-of-art implementation on a single GPU device as the baseline, our task queueing extension achieves a speedup up to 72.4%. Our extension also achieves scalable performance gains on multiple heterogeneous GPU devices. The performance trade-offs of using the host CPU as an accelerator are also evaluated.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115545101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

JaBEE: framework for object-oriented Java bytecode compilation and execution on graphics processor units 在图形处理器单元上用于面向对象的Java字节码编译和执行的框架

GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159439

Wojciech Zaremba, Yuan Lin, Vinod Grover

引用次数: 36

FLAT: a GPU programming framework to provide embedded MPI FLAT:提供嵌入式MPI的GPU编程框架

GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159433

T. Miyoshi, H. Irie, Keigo Shima, H. Honda, Masaaki Kondo, T. Yoshinaga

引用次数: 16

Reducing off-chip memory traffic by selective cache management scheme in GPGPUs 通过gpgpu的选择性缓存管理方案减少片外内存流量

GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159443

Hyojin Choi, Jae-Woo Ahn, Wonyong Sung

引用次数: 15