2011 Symposium on Application Accelerators in High-Performance Computing最新文献_第2页

Evaluation of GPU Architectures Using Spiking Neural Networks 使用峰值神经网络评估GPU架构

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.20

V. Pallipuram, M. Bhuiyan, M. C. Smith

{"title":"Evaluation of GPU Architectures Using Spiking Neural Networks","authors":"V. Pallipuram, M. Bhuiyan, M. C. Smith","doi":"10.1109/SAAHPC.2011.20","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.20","url":null,"abstract":"During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia's Tesla C2050, codenamed Fermi, and AMD's Radeon 5870 are two devices positioned to meet the computationally demanding needs of supercomputing research groups across the globe. Though Nvidia GPUs powered by CUDA have been the frequent choices of the performance centric research groups, the introduction and growth of OpenCL has promoted AMD GP-GPUs as potential accelerator candidates that can challenge Nvidia's stronghold. These architectures not only offer a plethora of features for application developers to explore, but their radically different architectures calls for a detailed study that weighs their merits and evaluates their potential to accelerate complex scientific applications. In this paper, we present our performance analysis research comparing Nvidia's Fermi and AMD's Radeon 5870 using OpenCL as the common programming model. We have chosen four different neuron models for Spiking Neural Networks (SNNs), each with different communication and computation requirements, namely the Izhikevich, Wilson, Morris Lecar (ML), and the Hodgkin Huxley (HH) models. We compare the runtime performance of the Fermi and Radeon GPUs with an implementation that exhausts all optimization techniques available with OpenCL. Several equivalent architectural parameters of the two GPUs are studied and correlated with the application performance. In addition to the comparative study effort, our implementations were able to achieve a speed-up of 857.3x and 658.51x on the Fermi and Radeon architectures respectively for the most compute intensive HH model with a dense network containing 9.72 million neurons. The final outcome of this research is a detailed architectural comparison of the two GPU architectures with a common programming platform.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116172632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

QUonG: A GPU-based HPC System Dedicated to LQCD Computing QUonG:一个基于gpu的LQCD计算HPC系统

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.15

R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini

{"title":"QUonG: A GPU-based HPC System Dedicated to LQCD Computing","authors":"R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini","doi":"10.1109/SAAHPC.2011.15","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.15","url":null,"abstract":"QUonG is an INFN (Istituto Nazionale di Fisica Nucleare) initiative targeted to develop a high performance computing system dedicated to Lattice QCD computations. QUonG is a massively parallel computing platform that lever-ages on commodity multi-core processors coupled with last generation GPUs. Its network mesh exploits the characteristics of LQCD algorithm for the design of a point-to-point, high performance, low latency 3-d torus network to interconnect the computing nodes. The network is built upon the APE net+ project: it consists of an FPGA-based PCI Express board exposing six full bidirectional off-board links running at 34 Gbps each, and implementing RDMA protocol and an experimental direct network-to-GPU interface, enabling significant access latency reduction for inter-node data transfers. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 60 TFlops/rack of peak performance, at a cost of 5 Ke/TFlops and for an estimated power consumption of 25 KW/rack. A first QUonG system prototype is expected to be delivered at the end of the year 2011.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126521069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Design and Simulation of a Rectangular Meshotron Unit Prototype 矩形介速加速器单元原型的设计与仿真

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.21

C.L.S. Romeiro, Guilherme Campos, Arnaldo S. R. Oliveira

引用次数: 0

Iterative Refinement on FPGAs fpga的迭代改进

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.19

Jun Kyu Lee, G. D. Peterson

引用次数: 8

Accelerating a Climate Physics Model with OpenCL 用OpenCL加速气候物理模型

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.17

F. Zafar, D. Ghosh, Lawrence Sebald, Shujia Zhou

引用次数: 5

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation NVIDIA gpu上的可适应二维滑动窗口与运行时编译

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.11

Nicholas Moore, M. Leeser, L. King

引用次数: 4

Quantum Chemical Many-Body Theory on Heterogeneous Nodes 非均相节点的量子化学多体理论

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.28

A. Eugene DePrince III, J. Hammond

{"title":"Quantum Chemical Many-Body Theory on Heterogeneous Nodes","authors":"A. Eugene DePrince III, J. Hammond","doi":"10.1109/SAAHPC.2011.28","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.28","url":null,"abstract":"The iterative solution of the coupled-cluster with single and double excitations (CCSD) equations is a very time-consuming component of the ``gold standard'' in quantum chemistry, the CCSD(T) method. In an effort to accelerate accurate quantum mechanical calculations, we explore two implementation strategies for the iterative solution of the CC equations on graphics procesing units (GPUs). We consider a communication-avoiding algorithm for the spin-free coupled cluster doubles (CCD) equations followed by a low-storage algorithm for the spin-free CCSD equations. In the communication-avoiding algorithm, the entire iterative procedure for the CCD method is performed on the GPU, resulting in accelerations of a factor of 4-5 relative to the pure CPU algorithm. The low-storage CCSD algorithm requires that a minimum of $4o^2v^2+2ov$ elements be stored on the device, where $o$ and $v$ represent the number of orbitals occupied and unoccupied in the reference configuration, respectively. The algorithm masks the transfer time for copying large amounts of data to the GPU by overlapping GPU and CPU computations. The per-iteration costs of this hybrid GPU/CPU algorithm are up to 4.06 times less than those of the pure CPU algorithm and up to 10.63 times less than those of the CCSD implementation found in the {small Molpro} electronic structure package. These results provide insight into how to organize communication and computation as to maximize utilization of a GPU and multicore CPU at the same time.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":"32 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132434304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing CPU+GPU融合处理器(APU)并行计算效能研究

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.29

Mayank Daga, Ashwin M. Aji, Wu-chun Feng

{"title":"On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing","authors":"Mayank Daga, Ashwin M. Aji, Wu-chun Feng","doi":"10.1109/SAAHPC.2011.29","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.29","url":null,"abstract":"The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that \"fuse\" the functionality of the CPU and GPU, e.g., AMD Fusion and Intel Knights Ferry, hold the promise of addressing the PCIe bottleneck. In this paper, we empirically characterize and analyze the efficacy of AMD Fusion, an architecture that combines general-purposex86 cores and programmable accelerator cores on the same silicon die. We characterize its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks(e.g., reduction), and actual applications (e.g., molecular dynamics). Depending on the benchmark, our results show that Fusion produces a 1.7 to 6.0-fold improvement in the data-transfer time, when compared to a discrete GPU. In turn, this improvement in data-transfer performance can significantly enhance application performance. For example, running a reduction benchmark on AMD Fusion with its mere 80 GPU cores improves performance by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with its 1600 more powerful GPU cores.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123728288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 138

G-NetMon: A GPU-accelerated Network Performance Monitoring System G-NetMon:一个gpu加速的网络性能监控系统

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.10

Wenji Wu, P. DeMar, D. Holmgren, Amitoj Singh

引用次数: 2

Application of Graphics Processing Units (GPUs) to the Study of Non-linear Dynamics of the Exciton Bose-Einstein Condensate in a Semiconductor Quantum Well 图形处理器(gpu)在半导体量子阱中激子玻色-爱因斯坦凝聚非线性动力学研究中的应用

2011 Symposium on Application Accelerators in High-Performance Computing Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.32

A. Gothandaraman, S. Sadatian, Michal Faryniarz, O. Berman, G. Kolmakov

引用次数: 4