Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis最新文献

Triangular matrix inversion on Graphics Processing Unit 三角矩阵反演图形处理单元

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI: 10.1145/1654059.1654069

F. Ries, T. DeMarco, Matteo Zivieri, R. Guerrieri

引用次数: 33

A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton 一个32x32x32，空间分布的3D FFT在4微秒安东

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI: 10.1145/1654059.1654083

C. Young, Joseph A. Bank, R. Dror, J. P. Grossman, J. Salmon, D. Shaw

{"title":"A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton","authors":"C. Young, Joseph A. Bank, R. Dror, J. P. Grossman, J. Salmon, D. Shaw","doi":"10.1145/1654059.1654083","DOIUrl":"https://doi.org/10.1145/1654059.1654083","url":null,"abstract":"Anton, a massively parallel special-purpose machine for molecular dynamics simulations, performs a 32 × 32 × 32 FFT in 3.7 microseconds and a 64 × 64 × 64 FFT in 13.3 microseconds on a configuration with 512 nodes-an order of magnitude faster than all other FFT implementations of which we are aware. Achieving this FFT performance requires a coordinated combination of computation and communication techniques that leverage Anton's underlying hardware mechanisms. Most significantly, Anton's communication subsystem provides over 300 gigabits per second of bandwidth per node, message latency in the hundreds of nanoseconds, and support for word-level writes and single-ended communication. In addition, Anton's general-purpose computation system incorporates primitives that support the efficient parallelization of small 1D FFTs. Although Anton was designed specifically for molecular dynamics simulations, a number of the hardware primitives and software implementation techniques described in this paper may also be applicable to the acceleration of FFTs on general-purpose high-performance machines.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115160172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

A massively parallel adaptive fast-multipole method on heterogeneous architectures 异构体系结构的大规模并行自适应快速多极方法

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI: 10.1145/1654059.1654118

I. Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan A. Nguyen, R. Sampath, A. Shringarpure, R. Vuduc, Lexing Ying, D. Zorin, G. Biros

{"title":"A massively parallel adaptive fast-multipole method on heterogeneous architectures","authors":"I. Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan A. Nguyen, R. Sampath, A. Shringarpure, R. Vuduc, Lexing Ying, D. Zorin, G. Biros","doi":"10.1145/1654059.1654118","DOIUrl":"https://doi.org/10.1145/1654059.1654118","url":null,"abstract":"We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"56 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126140500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 181

Predicting the execution time of grid workflow applications through local learning 通过局部学习预测网格工作流应用程序的执行时间

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI: 10.1145/1654059.1654093

F. Nadeem, T. Fahringer

引用次数: 37

Space-efficient time-series call-path profiling of parallel applications 并行应用程序的空间效率时间序列调用路径分析

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI: 10.1145/1654059.1654097

Z. Szebenyi, F. Wolf, B. Wylie

引用次数: 25

A design methodology for domain-optimized power-efficient supercomputing 一种领域优化的节能超级计算设计方法

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI: 10.1145/1654059.1654072

M. Mohiyuddin, M. Murphy, L. Oliker, J. Shalf, J. Wawrzynek, Samuel Williams

{"title":"A design methodology for domain-optimized power-efficient supercomputing","authors":"M. Mohiyuddin, M. Murphy, L. Oliker, J. Shalf, J. Wawrzynek, Samuel Williams","doi":"10.1145/1654059.1654072","DOIUrl":"https://doi.org/10.1145/1654059.1654072","url":null,"abstract":"As power has become the pre-eminent design constraint for future HPC systems, computational efficiency is being emphasized over simply peak performance. Recently, static benchmark codes have been used to find a power efficient architecture. Unfortunately, because compilers generate sub-optimal code, benchmark performance can be a poor indicator of the performance potential of architecture design points. Therefore, we present hardware/software cotuning as a novel approach for system design, in which traditional architecture space exploration is tightly coupled with software auto-tuning for delivering substantial improvements in area and power efficiency. We demonstrate the proposed methodology by exploring the parameter space of a Tensilica-based multi-processor running three of the most heavily used kernels in scientific computing, each with widely varying micro-architectural requirements: sparse matrix vector multiplication, stencil-based computations, and general matrix-matrix multiplication. Results demonstrate that co-tuning significantly improves hardware area and energy efficiency - a key driver for next generation of HPC system design.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"50 15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123003193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Autotuning multigrid with PetaBricks 使用PetaBricks自动调整多网格

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI: 10.1145/1654059.1654065

Cy P. Chan, Jason Ansel, Y. Wong, Saman P. Amarasinghe, A. Edelman

{"title":"Autotuning multigrid with PetaBricks","authors":"Cy P. Chan, Jason Ansel, Y. Wong, Saman P. Amarasinghe, A. Edelman","doi":"10.1145/1654059.1654065","DOIUrl":"https://doi.org/10.1145/1654059.1654065","url":null,"abstract":"Algorithmic choice is essential in any problem domain to realizing optimal computational performance. Multigrid is a prime example: not only is it possible to make choices at the highest grid resolution, but a program can switch techniques as the problem is recursively attacked on coarser grid levels to take advantage of algorithms with different scaling behaviors. Additionally, users with different convergence criteria must experiment with parameters to yield a tuned algorithm that meets their accuracy requirements. Even after a tuned algorithm has been found, users often have to start all over when migrating from one machine to another. We present an algorithm and autotuning methodology that address these issues in a near-optimal and efficient manner. The freedom of independently tuning both the algorithm and the number of iterations at each recursion level results in an exponential search space of tuned algorithms that have different accuracies and performances. To search this space efficiently, our autotuner utilizes a novel dynamic programming method to build efficient tuned algorithms from the bottom up. The results are customized multigrid algorithms that invest targeted computational power to yield the accuracy required by the user. The techniques we describe allow the user to automatically generate tuned multigrid cycles of different shapes targeted to the user's specific combination of problem, hardware, and accuracy requirements. These cycle shapes dictate the order in which grid coarsening and grid refinement are interleaved with both iterative methods, such as Jacobi or Successive Over-Relaxation, as well as direct methods, which tend to have superior performance for small problem sizes. The need to make choices between all of these methods brings the issue of variable accuracy to the forefront. Not only must the autotuning framework compare different possible multigrid cycle shapes against each other, but it also needs the ability to compare tuned cycles against both direct and (non-multigrid) iterative methods. We address this problem by using an accuracy metric for measuring the effectiveness of tuned cycle shapes and making comparisons over all algorithmic types based on this common yardstick. In our results, we find that the flexibility to trade performance versus accuracy at all levels of recursive computation enables us to achieve excellent performance on a variety of platforms compared to algorithmically static implementations of multigrid. Our implementation uses PetaBricks, an implicitly parallel programming language where algorithmic choices are exposed in the language. The PetaBricks compiler uses these choices to analyze, autotune, and verify the PetaBricks program. These language features, most notably the autotuner, were key in enabling our implementation to be clear, correct, and fast.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"17 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114047393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors 多核处理器回旋动力学粒子到网格插值的内存效率优化

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI: 10.1145/1654059.1654108

Kamesh Madduri, Samuel Williams, S. Ethier, L. Oliker, J. Shalf, E. Strohmaier, K. Yelick

{"title":"Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors","authors":"Kamesh Madduri, Samuel Williams, S. Ethier, L. Oliker, J. Shalf, E. Strohmaier, K. Yelick","doi":"10.1145/1654059.1654108","DOIUrl":"https://doi.org/10.1145/1654059.1654108","url":null,"abstract":"We present multicore parallelization strategies for the particle-to-grid interpolation step in the Gyrokinetic Toroidal Code (GTC), a 3D particle-in-cell (PIC) application to study turbulent transport in magnetic-confinement fusion devices. Particle-grid interpolation is a known performance bottleneck in several PIC applications. In GTC, this step involves particles depositing charges to a 3D toroidal mesh, and multiple particles may contribute to the charge at a grid point. We design new parallel algorithms for the GTC charge deposition kernel, and analyze their performance on three leading multicore platforms. We implement thirteen different variants for this kernel and identify the best-performing ones given typical PIC parameters such as the grid size, number of particles per cell, and the GTC-specific particle Larmor radius variation. We find that our best strategies can be 2x faster than the reference optimized MPI implementation, and our analysis provides insight into desirable architectural features for high-performance PIC simulation codes.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132046496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

SCAMPI: a scalable CAM-based algorithm for multiple pattern inspection SCAMPI:一种可扩展的基于cam的多模式检测算法

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI: 10.1145/1654059.1654106

F. Petrini, Virat Agarwal, D. Pasetto

{"title":"SCAMPI: a scalable CAM-based algorithm for multiple pattern inspection","authors":"F. Petrini, Virat Agarwal, D. Pasetto","doi":"10.1145/1654059.1654106","DOIUrl":"https://doi.org/10.1145/1654059.1654106","url":null,"abstract":"String matching is one of the most compute intensive steps in a network intrusion detection system. The growing network rates, rapidly approaching 10 Gbits/sec, and the large number of signatures that need to be scanned concurrently pose very demanding challenges to algorithmic design and practical implementation. In this paper we present SCAMPI, a ground-breaking string searching algorithm that is fast, space-efficient, scalable and resilient to attacks. SCAMPI is designed with a memory-centric model of complexity in mind, to minimize memory traffic and enhance data reuse with a careful compile-time data layout. The experimental evaluation executed on two families of multicore processors, Cell B.E and Intel Xeon E5472, shows that it is possible to obtain a processing rate of more than 2 Gbits/sec per core with very large dictionaries and heavy hitting rates. In the largest tested configuration, SCAMPI reaches 16 Gbits/sec on 8 Xeon cores, reaching, and in some cases exceeding, the performance of special-purpose processors and FPGA. Using SCAMPI we have been able to scan an input stream using a dictionary of 3.5 millions keywords, more than an order of magnitude larger than any published result in the literature and in commercial prototypes, at a rate of more than 1.2 Gbits/sec per processing core.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134371306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A configurable algorithm for parallel image-compositing applications 并行图像合成应用的可配置算法

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI: 10.1145/1654059.1654064

T. Peterka, David Goodell, R. Ross, Han-Wei Shen, R. Thakur

引用次数: 79