2011 Symposium on Application Accelerators in High-Performance Computing最新文献

筛选
英文 中文
Evaluation of GPU Architectures Using Spiking Neural Networks 使用峰值神经网络评估GPU架构
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.20
V. Pallipuram, M. Bhuiyan, M. C. Smith
{"title":"Evaluation of GPU Architectures Using Spiking Neural Networks","authors":"V. Pallipuram, M. Bhuiyan, M. C. Smith","doi":"10.1109/SAAHPC.2011.20","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.20","url":null,"abstract":"During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia's Tesla C2050, codenamed Fermi, and AMD's Radeon 5870 are two devices positioned to meet the computationally demanding needs of supercomputing research groups across the globe. Though Nvidia GPUs powered by CUDA have been the frequent choices of the performance centric research groups, the introduction and growth of OpenCL has promoted AMD GP-GPUs as potential accelerator candidates that can challenge Nvidia's stronghold. These architectures not only offer a plethora of features for application developers to explore, but their radically different architectures calls for a detailed study that weighs their merits and evaluates their potential to accelerate complex scientific applications. In this paper, we present our performance analysis research comparing Nvidia's Fermi and AMD's Radeon 5870 using OpenCL as the common programming model. We have chosen four different neuron models for Spiking Neural Networks (SNNs), each with different communication and computation requirements, namely the Izhikevich, Wilson, Morris Lecar (ML), and the Hodgkin Huxley (HH) models. We compare the runtime performance of the Fermi and Radeon GPUs with an implementation that exhausts all optimization techniques available with OpenCL. Several equivalent architectural parameters of the two GPUs are studied and correlated with the application performance. In addition to the comparative study effort, our implementations were able to achieve a speed-up of 857.3x and 658.51x on the Fermi and Radeon architectures respectively for the most compute intensive HH model with a dense network containing 9.72 million neurons. The final outcome of this research is a detailed architectural comparison of the two GPU architectures with a common programming platform.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116172632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
QUonG: A GPU-based HPC System Dedicated to LQCD Computing QUonG:一个基于gpu的LQCD计算HPC系统
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.15
R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini
{"title":"QUonG: A GPU-based HPC System Dedicated to LQCD Computing","authors":"R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini","doi":"10.1109/SAAHPC.2011.15","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.15","url":null,"abstract":"QUonG is an INFN (Istituto Nazionale di Fisica Nucleare) initiative targeted to develop a high performance computing system dedicated to Lattice QCD computations. QUonG is a massively parallel computing platform that lever-ages on commodity multi-core processors coupled with last generation GPUs. Its network mesh exploits the characteristics of LQCD algorithm for the design of a point-to-point, high performance, low latency 3-d torus network to interconnect the computing nodes. The network is built upon the APE net+ project: it consists of an FPGA-based PCI Express board exposing six full bidirectional off-board links running at 34 Gbps each, and implementing RDMA protocol and an experimental direct network-to-GPU interface, enabling significant access latency reduction for inter-node data transfers. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 60 TFlops/rack of peak performance, at a cost of 5 Ke/TFlops and for an estimated power consumption of 25 KW/rack. A first QUonG system prototype is expected to be delivered at the end of the year 2011.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126521069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Design and Simulation of a Rectangular Meshotron Unit Prototype 矩形介速加速器单元原型的设计与仿真
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.21
C.L.S. Romeiro, Guilherme Campos, Arnaldo S. R. Oliveira
{"title":"Design and Simulation of a Rectangular Meshotron Unit Prototype","authors":"C.L.S. Romeiro, Guilherme Campos, Arnaldo S. R. Oliveira","doi":"10.1109/SAAHPC.2011.21","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.21","url":null,"abstract":"A novel application-specific hardware (ASH) unit was designed to form the building block of the Meshotron -- aparallelisation network for three-dimensional (3D) digital wave guide-mesh (DWM) room acoustic models. The rectangular mesh topology was elected. This ASH unit was tested using professional hardware simulation tools, assuming 32-bit integer data. Room impulse responses (RIR) were obtained for a set of small models under different test conditions, using both single-unit and multi-unit configurations. They proved exactly identical to those obtained using 3D DWM modelling software for the same models and test conditions, which validates the design.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126024991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Iterative Refinement on FPGAs fpga的迭代改进
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.19
Jun Kyu Lee, G. D. Peterson
{"title":"Iterative Refinement on FPGAs","authors":"Jun Kyu Lee, G. D. Peterson","doi":"10.1109/SAAHPC.2011.19","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.19","url":null,"abstract":"Achievable accuracy for mixed precision iterative refinement depends on the precisions supported by computing platforms. Even though the arithmetic unit precision can be flexible for programmable logic computing architectures (e.g. FPGAs), previous work rarely discusses the performance benefits due to enabling flexible achievable accuracy. Hence, we propose an iterative refinement approach on FPGAs which employs an arbitrary precision for the iterative refinement to obtain an arbitrary accuracy. We implement single processing elements for the refinement on the Xilinx XC5VLX110T and compare them to Xilinx XC6VSX475T for performance estimation. This paper shows that the performance is similar to the NVIDIA GTX480 when a user requires accuracies between single and double precision, but the implementation can also produce beyond double precision accuracy.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132403504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Accelerating a Climate Physics Model with OpenCL 用OpenCL加速气候物理模型
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.17
F. Zafar, D. Ghosh, Lawrence Sebald, Shujia Zhou
{"title":"Accelerating a Climate Physics Model with OpenCL","authors":"F. Zafar, D. Ghosh, Lawrence Sebald, Shujia Zhou","doi":"10.1109/SAAHPC.2011.17","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.17","url":null,"abstract":"Open Computing Language (OpenCL) is fast becoming the standard for heterogeneous parallel computing. It is designed to run on CPUs, GPUs, and other accelerator architectures. By implementing a real world application, a solar radiation model component widely used in climate and weather models, we show that OpenCL multi-threaded programming and execution model can dramatically increase performance even on CPU architectures. Our preliminary investigation indicates that low-level vector instructions and code representations in OpenCL contribute to dramatic performance improvement over the serial version when compared with the execution of the serial code compiled across various compilers on multiple platforms with auto vectorization flags. However, the portability of OpenCL implementations needs to improve, even for CPU architectures.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125961877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation NVIDIA gpu上的可适应二维滑动窗口与运行时编译
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.11
Nicholas Moore, M. Leeser, L. King
{"title":"Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation","authors":"Nicholas Moore, M. Leeser, L. King","doi":"10.1109/SAAHPC.2011.11","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.11","url":null,"abstract":"For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121007921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Quantum Chemical Many-Body Theory on Heterogeneous Nodes 非均相节点的量子化学多体理论
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.28
A. Eugene DePrince III, J. Hammond
{"title":"Quantum Chemical Many-Body Theory on Heterogeneous Nodes","authors":"A. Eugene DePrince III, J. Hammond","doi":"10.1109/SAAHPC.2011.28","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.28","url":null,"abstract":"The iterative solution of the coupled-cluster with single and double excitations (CCSD) equations is a very time-consuming component of the ``gold standard'' in quantum chemistry, the CCSD(T) method. In an effort to accelerate accurate quantum mechanical calculations, we explore two implementation strategies for the iterative solution of the CC equations on graphics procesing units (GPUs). We consider a communication-avoiding algorithm for the spin-free coupled cluster doubles (CCD) equations followed by a low-storage algorithm for the spin-free CCSD equations. In the communication-avoiding algorithm, the entire iterative procedure for the CCD method is performed on the GPU, resulting in accelerations of a factor of 4-5 relative to the pure CPU algorithm. The low-storage CCSD algorithm requires that a minimum of $4o^2v^2+2ov$ elements be stored on the device, where $o$ and $v$ represent the number of orbitals occupied and unoccupied in the reference configuration, respectively. The algorithm masks the transfer time for copying large amounts of data to the GPU by overlapping GPU and CPU computations. The per-iteration costs of this hybrid GPU/CPU algorithm are up to 4.06 times less than those of the pure CPU algorithm and up to 10.63 times less than those of the CCSD implementation found in the {small Molpro} electronic structure package. These results provide insight into how to organize communication and computation as to maximize utilization of a GPU and multicore CPU at the same time.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132434304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing CPU+GPU融合处理器(APU)并行计算效能研究
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.29
Mayank Daga, Ashwin M. Aji, Wu-chun Feng
{"title":"On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing","authors":"Mayank Daga, Ashwin M. Aji, Wu-chun Feng","doi":"10.1109/SAAHPC.2011.29","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.29","url":null,"abstract":"The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that \"fuse\" the functionality of the CPU and GPU, e.g., AMD Fusion and Intel Knights Ferry, hold the promise of addressing the PCIe bottleneck. In this paper, we empirically characterize and analyze the efficacy of AMD Fusion, an architecture that combines general-purposex86 cores and programmable accelerator cores on the same silicon die. We characterize its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks(e.g., reduction), and actual applications (e.g., molecular dynamics). Depending on the benchmark, our results show that Fusion produces a 1.7 to 6.0-fold improvement in the data-transfer time, when compared to a discrete GPU. In turn, this improvement in data-transfer performance can significantly enhance application performance. For example, running a reduction benchmark on AMD Fusion with its mere 80 GPU cores improves performance by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with its 1600 more powerful GPU cores.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123728288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 138
G-NetMon: A GPU-accelerated Network Performance Monitoring System G-NetMon:一个gpu加速的网络性能监控系统
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.10
Wenji Wu, P. DeMar, D. Holmgren, Amitoj Singh
{"title":"G-NetMon: A GPU-accelerated Network Performance Monitoring System","authors":"Wenji Wu, P. DeMar, D. Holmgren, Amitoj Singh","doi":"10.1109/SAAHPC.2011.10","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.10","url":null,"abstract":"At Fermilab, we have prototyped a GPU-accelerated network performance monitoring system, called G-NetMon, to support large-scale scientific collaborations. In this work, we explore new opportunities in network traffic monitoring and analysis with GPUs. Our system exploits the data parallelism that exists within network flow data to provide fast analysis of bulk data movement between Fermilab and collaboration sites. Experiments demonstrate that our G-NetMon can rapidly detect sub-optimal bulk data movements.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121445719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Application of Graphics Processing Units (GPUs) to the Study of Non-linear Dynamics of the Exciton Bose-Einstein Condensate in a Semiconductor Quantum Well 图形处理器(gpu)在半导体量子阱中激子玻色-爱因斯坦凝聚非线性动力学研究中的应用
2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.32
A. Gothandaraman, S. Sadatian, Michal Faryniarz, O. Berman, G. Kolmakov
{"title":"Application of Graphics Processing Units (GPUs) to the Study of Non-linear Dynamics of the Exciton Bose-Einstein Condensate in a Semiconductor Quantum Well","authors":"A. Gothandaraman, S. Sadatian, Michal Faryniarz, O. Berman, G. Kolmakov","doi":"10.1109/SAAHPC.2011.32","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.32","url":null,"abstract":"In this paper, we explore the use of Graphics Processing Units (GPUs) to solve numerically the nonlinear Gross-Pitaevskii equation with an external potential. Our implementation uses NVIDIA's Compute Unified Device Architecture (CUDA) programming paradigm and demonstrates a speedup of 190x on an NVIDIA Tesla C2050 (Fermi) GPU compared to an optimized software implementation on a single-core of an Intel Xeon 5500-series processor. We apply the developed technique to the study of Bose-Einstein condensation (BEC) of excitons in semiconductor nanostructures. The technique is also applicable to the studies of atomic condensates, quantized vortices in quantum fluids, propagation of light pulses in optical wave guides, and ocean wave dynamics.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114578588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信