GPGPU-5最新文献

筛选
英文 中文
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons 介绍'Bones':一个基于算法骨架的并行源码到源码编译器
GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159431
C. Nugteren, H. Corporaal
{"title":"Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons","authors":"C. Nugteren, H. Corporaal","doi":"10.1145/2159430.2159431","DOIUrl":"https://doi.org/10.1145/2159430.2159431","url":null,"abstract":"Recent advances in multi-core and many-core processors requires programmers to exploit an increasing amount of parallelism from their applications. Data parallel languages such as CUDA and OpenCL make it possible to take advantage of such processors, but still require a large amount of effort from programmers.\u0000 A number of parallelizing source-to-source compilers have recently been developed to ease programming of multi-core and many-core processors. This work presents and evaluates a number of such tools, focused in particular on C-to-CUDA transformations targeting GPUs. We compare these tools both qualitatively and quantitatively to each other and identify their strengths and weaknesses.\u0000 In this paper, we address the weaknesses by presenting a new classification of algorithms. This classification is used in a new source-to-source compiler, which is based on the algorithmic skeletons technique. The compiler generates target code based on skeletons of parallel structures, which can be seen as parameterisable library implementations for a set of algorithm classes. We furthermore demonstrate that the presented compiler requires little modifications to the original sequential source code, generates readable code for further fine-tuning, and delivers superior performance compared to other tools for a set of 8 image processing kernels.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125841002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations 基于gpu的结构化网格计算的高性能稀疏矩阵向量乘法
GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159436
J. Godwin, Justin Holewinski, P. Sadayappan
{"title":"High-performance sparse matrix-vector multiplication on GPUs for structured grid computations","authors":"J. Godwin, Justin Holewinski, P. Sadayappan","doi":"10.1145/2159430.2159436","DOIUrl":"https://doi.org/10.1145/2159430.2159436","url":null,"abstract":"In this paper, we address efficient sparse matrix-vector multiplication for matrices arising from structured grid problems with high degrees of freedom at each grid node. Sparse matrix-vector multiplication is a critical step in the iterative solution of sparse linear systems of equations arising in the solution of partial differential equations using uniform grids for discretization. With uniform grids, the resulting linear system Ax = b has a matrix A that is sparse with a very regular structure. The specific focus of this paper is on sparse matrices that have a block structure due to the large number of unknowns at each grid point. Sparse matrix storage formats such as Compressed Sparse Row (CSR) and Diagonal format (DIA) are not the most effective for such matrices.\u0000 In this work, we present a new sparse matrix storage format that takes advantage of the diagonal structure of matrices for stencil operations on structured grids. Unlike other formats such as the Diagonal storage format (DIA), we specifically optimize for the case of higher degrees of freedom, where formats such as DIA are forced to explicitly represent many zero elements in the sparse matrix. We develop efficient sparse matrix-vector multiplication for structured grid computations on GPU architectures using CUDA [25].","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127912703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
A distributed data-parallel framework for analysis and visualization algorithm development 用于分析和可视化算法开发的分布式数据并行框架
GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159432
J. Meredith, R. Sisneros, D. Pugmire, Sean Ahern
{"title":"A distributed data-parallel framework for analysis and visualization algorithm development","authors":"J. Meredith, R. Sisneros, D. Pugmire, Sean Ahern","doi":"10.1145/2159430.2159432","DOIUrl":"https://doi.org/10.1145/2159430.2159432","url":null,"abstract":"The coming generation of supercomputing architectures will require fundamental changes in programming models to effectively make use of the expected million to billion way concurrency and thousand-fold reduction in per-core memory. Most current parallel analysis and visualization tools achieve scalability by partitioning the data, either spatially or temporally, and running serial computational kernels on each data partition, using message passing as needed. These techniques lack the necessary level of data parallelism to execute effectively on the underlying hardware. This paper introduces a framework that enables the expression of analysis and visualization algorithms with memory-efficient execution in a hybrid distributed and data parallel manner on both multi-core and many-core processors. We demonstrate results on scientific data using CPUs and GPUs in scalable heterogeneous systems.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114056161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Full system simulation of many-core heterogeneous SoCs using GPU and QEMU semihosting 基于GPU和QEMU半托管的多核异构soc全系统仿真
GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159442
Shivani Raghav, A. Marongiu, Christian Pinto, David Atienza Alonso, M. Ruggiero, L. Benini
{"title":"Full system simulation of many-core heterogeneous SoCs using GPU and QEMU semihosting","authors":"Shivani Raghav, A. Marongiu, Christian Pinto, David Atienza Alonso, M. Ruggiero, L. Benini","doi":"10.1145/2159430.2159442","DOIUrl":"https://doi.org/10.1145/2159430.2159442","url":null,"abstract":"Modern system-on-chips are evolving towards complex and heterogeneous platforms with general purpose processors coupled with massively parallel manycore accelerator fabrics (e.g. embedded GPUs). Platform developers are looking for efficient full-system simulators capable of simulating complex applications, middleware and operating systems on these heterogeneous targets. Unfortunately current virtual platforms are not able to tackle the complexity and heterogeneity of state-of-the-art SoCs. Software emulators, such as the open-source QEMU project, cope quite well in terms of simulation speed and functional accuracy with homogeneous coarse-grained multi-cores. The main contribution of this paper is the introduction of a novel virtual prototyping technique which exploits the heterogeneous accelerators available in commodity PCs to tackle the heterogeneity challenge in full-SoC system simulation. In a nutshell, our approach makes it possible to partition simulation between the host CPU and GPU. More specifically, QEMU runs on the host CPU and the simulation of manycore accelerators is offloaded, through semi-hosting, to the host GPU. Our experimental results confirm the flexibility and efficiency of our enhanced QEMU environment.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115663403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A GPU-based high-throughput image retrieval algorithm 一种基于gpu的高吞吐量图像检索算法
GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159434
Feiwen Zhu, Peng Chen, Donglei Yang, Weihua Zhang, Haibo Chen, B. Zang
{"title":"A GPU-based high-throughput image retrieval algorithm","authors":"Feiwen Zhu, Peng Chen, Donglei Yang, Weihua Zhang, Haibo Chen, B. Zang","doi":"10.1145/2159430.2159434","DOIUrl":"https://doi.org/10.1145/2159430.2159434","url":null,"abstract":"With the development of Internet and cloud computing, multimedia data, such as images and videos, has become one of the most common data types being processed. As the scale of multimedia data being still increasing, it is vitally important to efficiently extract useful information from such a huge amount of multimedia data. However, due to the complexity of the core algorithms, multimedia retrieval applications are not only data intensive but also computationally intensive. Therefore, it has been a major challenge to accelerate the processing speed of such applications to satisfy the real-time requirement.\u0000 As Graphic Processing Unit (GPU) has entered the general-propose computing domain (GPGPU), it has become one of the most popular accelerators for the applications with real-time requirements. In this paper, we parallelize a widely-used image retrieval algorithm called SURF on GPGPU, which is the core algorithm for many video and image retrieval applications. We first analyze the parallelism within SURF to guarantee that there are sufficient tasks being mapped to the large-scale computation resources in GPGPU. We then exploit some inherent GPGPU characteristics, such as 2D memory, to further boost the performance. Finally, we provide some optimization to the cooperation between CPU and GPGPU, which is generally ignored in previous designs. Experimental results show that our parallelization and optimization achieve a throughput of 340.5 frames/s on a NVIDIA GTX295 GPGPU, which is 15X faster than the maximal optimized CPU version. Compared to CUDA SURF, a state-of-the-art parallelization of SURF on GPGPU, our system achieves a speedup by a factor of 2.3X.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123683229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Dynamic particle system for mesh extraction on the GPU 动态粒子系统的网格提取在GPU上
GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159435
Mark Kim, Guoning Chen, C. Hansen
{"title":"Dynamic particle system for mesh extraction on the GPU","authors":"Mark Kim, Guoning Chen, C. Hansen","doi":"10.1145/2159430.2159435","DOIUrl":"https://doi.org/10.1145/2159430.2159435","url":null,"abstract":"Extracting isosurfaces represented as high quality meshes from three-dimensional scalar fields is needed for many important applications, particularly visualization and numerical simulations. One recent advance for extracting high quality meshes for isosurface computation is based on a dynamic particle system. Unfortunately, this state-of-the-art particle placement technique requires a significant amount of time to produce a satisfactory mesh. To address this issue, we study the parallelism property of the particle placement and make use of CUDA, a parallel programming technique on the GPU, to significantly improve the performance of particle placement. This paper describes the curvature dependent sampling method used to extract high quality meshes and describes its implementation using CUDA on the GPU.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"85 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122603353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Enabling task-level scheduling on heterogeneous platforms 在异构平台上启用任务级调度
GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159440
Enqiang Sun, Dana Schaa, Richard Bagley, Norman Rubin, D. Kaeli
{"title":"Enabling task-level scheduling on heterogeneous platforms","authors":"Enqiang Sun, Dana Schaa, Richard Bagley, Norman Rubin, D. Kaeli","doi":"10.1145/2159430.2159440","DOIUrl":"https://doi.org/10.1145/2159430.2159440","url":null,"abstract":"OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices, regardless of vendor. While OpenCL has been widely adopted, there still remains a lack of support for automatic task scheduling and data consistency when multiple devices appear in the system. To address this need, we have designed a task queueing extension for OpenCL that provides a high-level, unified execution model tightly coupled with a resource management facility. The main motivation for developing this extension is to provide OpenCL programmers with a convenient programming paradigm to fully utilize all possible devices in a system and incorporate flexible scheduling schemes. To demonstrate the value and utility of this extension, we have utilized an advanced OpenCL-based imaging toolkit called clSURF. Using our task queueing extension, we demonstrate the potential performance opportunities and limitations given current vendor implementations of OpenCL. Using a state-of-art implementation on a single GPU device as the baseline, our task queueing extension achieves a speedup up to 72.4%. Our extension also achieves scalable performance gains on multiple heterogeneous GPU devices. The performance trade-offs of using the host CPU as an accelerator are also evaluated.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115545101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
JaBEE: framework for object-oriented Java bytecode compilation and execution on graphics processor units 在图形处理器单元上用于面向对象的Java字节码编译和执行的框架
GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159439
Wojciech Zaremba, Yuan Lin, Vinod Grover
{"title":"JaBEE: framework for object-oriented Java bytecode compilation and execution on graphics processor units","authors":"Wojciech Zaremba, Yuan Lin, Vinod Grover","doi":"10.1145/2159430.2159439","DOIUrl":"https://doi.org/10.1145/2159430.2159439","url":null,"abstract":"There is an increasing interest from software developers in executing Java and .NET bytecode programs on General Purpose Graphics Processor Units (GPGPUs). Existing solutions have limited support for operations on objects and often require explicit handling of memory transfers between CPU and GPU. In this paper, we describe a Java Bytecode Execution Environment (JaBEE) which supports common object-oriented constructs such as dynamic dispatch, encapsulation and object creation on GPUs. This experimental environment facilitates GPU code compilation, execution and transparent memory management. We compare the performance of our approach with CPU-based and CUDA-C-based code executions of the same programs. We discuss challenges, limitations and opportunities of bytecode execution on GPGPUs.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131905318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
FLAT: a GPU programming framework to provide embedded MPI FLAT:提供嵌入式MPI的GPU编程框架
GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159433
T. Miyoshi, H. Irie, Keigo Shima, H. Honda, Masaaki Kondo, T. Yoshinaga
{"title":"FLAT: a GPU programming framework to provide embedded MPI","authors":"T. Miyoshi, H. Irie, Keigo Shima, H. Honda, Masaaki Kondo, T. Yoshinaga","doi":"10.1145/2159430.2159433","DOIUrl":"https://doi.org/10.1145/2159430.2159433","url":null,"abstract":"For leveraging multiple GPUs in a cluster system, it is necessary to assign application tasks to multiple GPUs and execute those tasks with appropriately using communication primitives to handle data transfer among GPUs. In current GPU programming models, communication primitives such as MPI functions cannot be used within GPU kernels. Instead, such functions should be used in the CPU code. Therefore, programmer must handle both GPU kernel and CPU code for data communications. This makes GPU programming and its optimization very difficult.\u0000 In this paper, we propose a programming framework named FLAT which enables programmers to use MPI functions within GPU kernels. Our framework automatically transforms MPI functions written in a GPU kernel into runtime routines executed on the CPU. The execution model and the implementation of FLAT are described, and the applicability of FLAT in terms of scalability and programmability is discussed. We also evaluate the performance of FLAT. The result shows that FLAT achieves good scalability for intended applications.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114144642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Reducing off-chip memory traffic by selective cache management scheme in GPGPUs 通过gpgpu的选择性缓存管理方案减少片外内存流量
GPGPU-5 Pub Date : 2012-03-03 DOI: 10.1145/2159430.2159443
Hyojin Choi, Jae-Woo Ahn, Wonyong Sung
{"title":"Reducing off-chip memory traffic by selective cache management scheme in GPGPUs","authors":"Hyojin Choi, Jae-Woo Ahn, Wonyong Sung","doi":"10.1145/2159430.2159443","DOIUrl":"https://doi.org/10.1145/2159430.2159443","url":null,"abstract":"The performance of General Purpose Graphics Processing Units (GPGPUs) is frequently limited by the off-chip memory bandwidth. To mitigate this bandwidth wall problem, recent GPUs are equipped with on-chip L1 and L2 caches. However, there has been little work for better utilizing on-chip shared caches in GPGPUs. In this paper, we propose two cache management schemes: write-buffering and read-bypassing. The write buffering technique tries to utilize the shared cache for inter-block communication, and thereby reduces the DRAM accesses as much as the capacity of the cache. The read-bypassing scheme prevents the shared cache from being polluted by streamed data that are consumed only within a thread-block. The proposed schemes can be selectively applied to global memory instructions using newly defined cache operators. We evaluate the effects of the proposed schemes for a few GPGPU applications by simulations. We have shown that the off-chip memory accesses can be successfully reduced by the proposed techniques. We also analyze the effectiveness of these methods when the throughput gap between cores and off-chip memory becomes wider.","PeriodicalId":232750,"journal":{"name":"GPGPU-5","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129625629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信