2011 Symposium on Application Accelerators in High-Performance Computing最新文献

筛选
英文 中文
A Study of the Performance of Multifluid PPM Gas Dynamics on CPUs and GPUs 多流体PPM气体动力学在cpu和gpu上的性能研究
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.27
Pei-Hung Lin, J. Jayaraj, P. Woodward
{"title":"A Study of the Performance of Multifluid PPM Gas Dynamics on CPUs and GPUs","authors":"Pei-Hung Lin, J. Jayaraj, P. Woodward","doi":"10.1109/SAAHPC.2011.27","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.27","url":null,"abstract":"The potential for GPUs and many-core CPUs to support high performance computation in the area of computational fluid dynamics (CFD) is explored quantitatively through the example of the PPM gas dynamics code with PPB multi fluid volume fraction advection. This code has already been implemented on the IBM Cell processor and run at full scale on the Los Alamos Roadrunner machine. This implementation has involved a complete restructuring of the code that has been described in detail elsewhere. Here the lessons learned from that work are exploited to take advantage oftoday's latest generations of multi-core CPUs and many-core GPUs. The operations performed by this code are characterized in detail after being first decomposed into a series of individual code kernels to allow an implementation on GPUs. Careful implementations of this code for both CPUs and GPU sare then contrasted from a performance point of view. In addition, a single kernel that has many of the characteristics of the full application on CPUs has been built into a full, standalone, scalable parallel application. This single-kernel application shows the GPU at its best. In contrast, the full multi fluid gas dynamics application brings into play computational requirements that highlight the essential differences in CPU and GPU designs today and the different programming strategies needed to achieve the best performance for applications of this type on the two devices. The single kernel application code performs extremely well on both platforms. This application is not limited by main memory bandwidth on either device instead it is limited only by the computational capability of each. In this case, the GPU has the advantage, because it has more computational cores. The full multi fluid gas dynamics code is, however, of necessity memory bandwidth limited on the GPU, while it is still computational capability limited on the CPU. We believe that these codes provide a useful context for quantifying the costs and benefits of design decisions for these powerful new computing devices. Suggestions for improvements in both devices and codes based upon this work are offered in our conclusions.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127015300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Real-Time Object Tracking System on FPGAs 基于fpga的实时目标跟踪系统
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.22
S. Liu, Alexandros Papakonstantinou, Hongjun Wang, Deming Chen
{"title":"Real-Time Object Tracking System on FPGAs","authors":"S. Liu, Alexandros Papakonstantinou, Hongjun Wang, Deming Chen","doi":"10.1109/SAAHPC.2011.22","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.22","url":null,"abstract":"Object tracking is an important task in computer vision applications. One of the crucial challenges is the real-time speed requirement. In this paper we implement an object tracking system in reconfigurable hardware using an efficient parallel architecture. In our implementation, we adopt a background subtraction based algorithm. The designed object tracker exploits hardware parallelism to achieve high system speed. We also propose a dual object region search technique to further boost the performance of our system under complex tracking conditions. For our hardware implementation we use the Alter a Stratix III EP3SL340H1152C2 FPGA device. We compare the proposed FPGA-based implementation with the software implementation running on a 2.2 GHz processor. The observed speedup can reach more than 100X for complex video inputs.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130236675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Implications of Memory-Efficiency on Sparse Matrix-Vector Multiplication 稀疏矩阵-向量乘法中内存效率的含义
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.24
Shweta Jain, Robin Pottathuparambil, R. Sass
{"title":"Implications of Memory-Efficiency on Sparse Matrix-Vector Multiplication","authors":"Shweta Jain, Robin Pottathuparambil, R. Sass","doi":"10.1109/SAAHPC.2011.24","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.24","url":null,"abstract":"Sparse Matrix Vector-Multiplication is an important operation for many iterative solvers. However, peak performance is limited by the fact that the commonly used algorithm alternates between compute-bound and memory-bound steps. This paper proposes a novel data structure and an FPGA-based hardware core that eliminates the limitations imposed by memory.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114344061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Porting Optimized GPU Kernels to a Multi-core CPU: Computational Quantum Chemistry Application Example 将优化的GPU内核移植到多核CPU:计算量子化学应用示例
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.8
Dong Ye, Alexey Titov, V. Kindratenko, Ivan S. Ufimtsev, Todd J. Martinez
{"title":"Porting Optimized GPU Kernels to a Multi-core CPU: Computational Quantum Chemistry Application Example","authors":"Dong Ye, Alexey Titov, V. Kindratenko, Ivan S. Ufimtsev, Todd J. Martinez","doi":"10.1109/SAAHPC.2011.8","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.8","url":null,"abstract":"We investigate techniques for optimizing a multi-core CPU code back ported from a highly optimized GPU kernel. We show that common sub-expression elimination and loop unrolling optimization techniques improve code performance on the GPU, but not on the CPU. On the other hand, register reuse and loop merging are effective on the CPU and in combination they improve performance of the ported code by 16%.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123837888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Experience Applying Fortran GPU Compilers to Numerical Weather Prediction 应用Fortran GPU编译器进行数值天气预报的经验
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.9
T. Henderson, J. Middlecoff, J. Rosinski, M. Govett, P. Madden
{"title":"Experience Applying Fortran GPU Compilers to Numerical Weather Prediction","authors":"T. Henderson, J. Middlecoff, J. Rosinski, M. Govett, P. Madden","doi":"10.1109/SAAHPC.2011.9","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.9","url":null,"abstract":"Graphics Processing Units (GPUs) have enabled significant improvements in computational performance compared to traditional CPUs in several application domains. Until recently, GPUs have been programmed using C/C++ based methods such as CUDA (NVIDIA) and OpenCL (NVIDIA and AMD). Using these approaches, Fortran Numerical Weather Prediction (NWP) codes would have to be completely re-written to take full advantage of GPU performance gains. Emerging commercial Fortran compilers allow NWP codes to take advantage of GPU processing power with much less software development effort. The Non-hydrostatic Icosahedral Model (NIM) is a prototype dynamical core for global NWP. We use NIM to examine Fortran directive-based GPU compilers, evaluating code porting effort and computational performance.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114284612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures 一类用于多核和GPU架构的混合LAPACK算法
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.18
Mitchel D. Horton, S. Tomov, J. Dongarra
{"title":"A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures","authors":"Mitchel D. Horton, S. Tomov, J. Dongarra","doi":"10.1109/SAAHPC.2011.18","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.18","url":null,"abstract":"Three out of the top four supercomputers in the November 2010 TOP500 list of the world's most powerful supercomputers use NVIDIA GPUs to accelerate computations. Ninety-five systems from the list are using processors with six or more cores. Three-hundred-sixty-five systems use quad-core processor-based systems. Thirty-seven systems are using dual-core processors. The large-scale enabling of hybrid graphics processing unit (GPU)-based multicore platforms for computational science by developing fundamental numerical libraries (in particular, libraries in the area of dense linear algebra) for them has been underway for some time. We present a class of algorithms based largely on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. The algorithms extend what is currently available in the Matrix Algebra for GPU and Multicore Architectures (MAGMA) Library for performing Cholesky, QR, and LU factorizations using a single core or socket and a single GPU. The extensions occur in two areas. First, panels factored on the CPU using LAPACK are, instead, done in parallel using a highly optimized dynamic asynchronous scheduled algorithm on some number of CPU cores. Second, the remaining CPU cores are used to update the rightmost panels of the matrix in parallel.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115816383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
A First Analysis of a Dynamic Memory Allocation Controller (DMAC) Core 动态内存分配控制器(DMAC)核心初探
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.23
Y. Rajasekhar, R. Sass
{"title":"A First Analysis of a Dynamic Memory Allocation Controller (DMAC) Core","authors":"Y. Rajasekhar, R. Sass","doi":"10.1109/SAAHPC.2011.23","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.23","url":null,"abstract":"Networking performance continues to grow but processor clock frequencies have not. Likewise, the latency to primary memory is not expected to improve dramatically either. This is leading computer architects to reconsider the networking subsystem and the roles and responsibilities of hardware and the operating system. This paper presents the first component of a new networking subsystem where the hardware is responsible for buffering, when necessary, messages without interrupting or involving the operating system. The design is presented and its functionality is demonstrated. The core on an FPGA is exercised with a synthetic stream of messages and the results show that the analytical performance model and measured performance agree.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127431984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Python for Development of OpenMP and CUDA Kernels for Multidimensional Data 用于多维数据的OpenMP和CUDA内核开发的Python
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.26
B. Vacaliuc, D. Patlolla, E. D'Azevedo, G. Davidson, John K. Munro Jr, T. Evans, W. Joubert, Z. Bell
{"title":"Python for Development of OpenMP and CUDA Kernels for Multidimensional Data","authors":"B. Vacaliuc, D. Patlolla, E. D'Azevedo, G. Davidson, John K. Munro Jr, T. Evans, W. Joubert, Z. Bell","doi":"10.1109/SAAHPC.2011.26","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.26","url":null,"abstract":"Design of data structures for high performance computing (HPC) is one of the principal challenges facing researchers looking to utilize heterogeneous computing machinery. Heterogeneous systems derive cost, power, and speed efficiency by being composed of the appropriate hardware for the task. Yet, each type of processor requires a specific organization of the application state in order to achieve peak performance. Discovering this and refactoring the code can be a challenging and time-consuming task for the researcher, as the data structures and the computational model must be co-designed. We present a methodology that uses Python as the environment for which to explore tradeoffs in both the data structure design as well as the code executing on the computation accelerator. Our method enables multi-dimensional arrays to be used effectively in any target environment. We have chosen to focus on OpenMP and CUDA environments, thus exploring the development of optimized kernels for the two most common classes of computing hardware available today: multi-core CPU and GPU. Python's large palette of file and network access routines, its associative indexing syntax and support for common HPC environments makes it relevant for diverse hardware ranging from laptops through computing clusters to the highest performance supercomputers. Our work enables researchers to accelerate the development of their codes on the computing hardware of their choice.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130618018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Non-serial Polyadic Dynamic Programming on a Data-Parallel Many-core Architecture 数据并行多核体系结构上的非串行多进动态规划
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.25
M. Moazeni, M. Sarrafzadeh, A. Bui
{"title":"Non-serial Polyadic Dynamic Programming on a Data-Parallel Many-core Architecture","authors":"M. Moazeni, M. Sarrafzadeh, A. Bui","doi":"10.1109/SAAHPC.2011.25","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.25","url":null,"abstract":"Dynamic Programming (DP) is a method for efficiently solving a broad range of search and optimization problems. As a result, techniques for managing large-scale DP problems are often critical to the performance of many applications. DP algorithms are often hard to parallelize. In this paper, we address the challenge of exploiting fine grain parallelism on a family of DP algorithms known as non-serial polyadic. We use an abstract formulation of non-serial polyadic DP, derived from RNA secondary structure prediction and matrix parenthesization approaches that are well-known and important problems from this family. We present a load balancing algorithm that achieves the best overall performance with this type of workload on many-core architectures. A divide-and-conquer approach previously used on multi-core architectures is compared against an iterative version. To evaluate these approaches, the algorithm was implemented on three NVIDIA GPUs using CUDA. We achieved up to 10 GFLOP/s performance and up to 228x speedup over the single-threaded CPU implementation. Moreover, the iterative approach results in up to 3.92x speedup over the divide-and-conquer approach.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130167493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GPU Performance Comparison for Accelerated Radar Data Processing 加速雷达数据处理的GPU性能比较
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.14
C. Fallen, B.V.C. Bellamy, G. Newby, B. Watkins
{"title":"GPU Performance Comparison for Accelerated Radar Data Processing","authors":"C. Fallen, B.V.C. Bellamy, G. Newby, B. Watkins","doi":"10.1109/SAAHPC.2011.14","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.14","url":null,"abstract":"Radar is a data-intensive measurement technique often requiring significant processing to make full use of the received signal. However, computing capacity is limited at remote or mobile radar installations thereby limiting radar data products used for real-time decisions. We used graphics processing units (GPUs) to accelerate processing of high resolution phase-coded radar data from the Modular UHF Ionosphere Radar (MUIR) at the High-frequency Active Auroral Research Program (HAARP) facility in Gakona, Alaska. Previously, this data could not be processed on-site in sufficient time to be useful for decisions made during active experiment campaigns, nor could the data be uploaded for off-site processing to high-performance computing (HPC) resources at the Arctic Region Supercomputing Center (ARSC) in Fairbanks. In this paper, we present a radar data-processing performance comparison of a workstation equipped with dual NVIDIA GeForce GTX 480 GPU accelerator cards and a node from ARSC's PACMAN cluster equipped with dual NVIDIA Tesla M2050 cards. Both platforms meet performance requirements, are relatively inexpensive and could operate effectively at remote observatories such as HAARP.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129045473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信