Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis最新文献

筛选
英文 中文
Triangular matrix inversion on Graphics Processing Unit 三角矩阵反演图形处理单元
F. Ries, T. DeMarco, Matteo Zivieri, R. Guerrieri
{"title":"Triangular matrix inversion on Graphics Processing Unit","authors":"F. Ries, T. DeMarco, Matteo Zivieri, R. Guerrieri","doi":"10.1145/1654059.1654069","DOIUrl":"https://doi.org/10.1145/1654059.1654069","url":null,"abstract":"Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114982525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton 一个32x32x32,空间分布的3D FFT在4微秒安东
C. Young, Joseph A. Bank, R. Dror, J. P. Grossman, J. Salmon, D. Shaw
{"title":"A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton","authors":"C. Young, Joseph A. Bank, R. Dror, J. P. Grossman, J. Salmon, D. Shaw","doi":"10.1145/1654059.1654083","DOIUrl":"https://doi.org/10.1145/1654059.1654083","url":null,"abstract":"Anton, a massively parallel special-purpose machine for molecular dynamics simulations, performs a 32 × 32 × 32 FFT in 3.7 microseconds and a 64 × 64 × 64 FFT in 13.3 microseconds on a configuration with 512 nodes-an order of magnitude faster than all other FFT implementations of which we are aware. Achieving this FFT performance requires a coordinated combination of computation and communication techniques that leverage Anton's underlying hardware mechanisms. Most significantly, Anton's communication subsystem provides over 300 gigabits per second of bandwidth per node, message latency in the hundreds of nanoseconds, and support for word-level writes and single-ended communication. In addition, Anton's general-purpose computation system incorporates primitives that support the efficient parallelization of small 1D FFTs. Although Anton was designed specifically for molecular dynamics simulations, a number of the hardware primitives and software implementation techniques described in this paper may also be applicable to the acceleration of FFTs on general-purpose high-performance machines.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115160172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
A massively parallel adaptive fast-multipole method on heterogeneous architectures 异构体系结构的大规模并行自适应快速多极方法
I. Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan A. Nguyen, R. Sampath, A. Shringarpure, R. Vuduc, Lexing Ying, D. Zorin, G. Biros
{"title":"A massively parallel adaptive fast-multipole method on heterogeneous architectures","authors":"I. Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan A. Nguyen, R. Sampath, A. Shringarpure, R. Vuduc, Lexing Ying, D. Zorin, G. Biros","doi":"10.1145/1654059.1654118","DOIUrl":"https://doi.org/10.1145/1654059.1654118","url":null,"abstract":"We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"56 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126140500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 181
Predicting the execution time of grid workflow applications through local learning 通过局部学习预测网格工作流应用程序的执行时间
F. Nadeem, T. Fahringer
{"title":"Predicting the execution time of grid workflow applications through local learning","authors":"F. Nadeem, T. Fahringer","doi":"10.1145/1654059.1654093","DOIUrl":"https://doi.org/10.1145/1654059.1654093","url":null,"abstract":"Workflow execution time prediction is widely seen as a key service to understand the performance behavior and support the optimization of Grid workflow applications. In this paper, we present a novel approach for estimating the execution time of workflows based on Local Learning. The workflows are characterized in terms of different attributes describing structural and runtime information about workflow activities, control and data flow dependencies, number of Grid sites, problem size, etc. Our local learning framework is complemented by a dynamic weighing scheme that assigns weights to workflow attributes reflecting their impact on the workflow execution time. Predictions are given through intervals bounded by the minimum and maximum predicted values, which are associated with a confidence value indicating the degree of confidence about the prediction accuracy. Evaluation results for three real world workflows on a real Grid are presented to demonstrate the prediction accuracy and overheads of the proposed method.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126861735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Space-efficient time-series call-path profiling of parallel applications 并行应用程序的空间效率时间序列调用路径分析
Z. Szebenyi, F. Wolf, B. Wylie
{"title":"Space-efficient time-series call-path profiling of parallel applications","authors":"Z. Szebenyi, F. Wolf, B. Wylie","doi":"10.1145/1654059.1654097","DOIUrl":"https://doi.org/10.1145/1654059.1654097","url":null,"abstract":"The performance behavior of parallel simulations often changes considerably as the simulation progresses - with potentially process-dependent variations of temporal patterns. While call-path profiling is an established method of linking a performance problem to the context in which it occurs, call paths reveal only little information about the temporal evolution of performance phenomena. However, generating call-path profiles separately for thousands of iterations may exceed available buffer space - especially when the call tree is large and more than one metric is collected. In this paper, we present a runtime approach for the semantic compression of call-path profiles based on incremental clustering of a series of single-iteration profiles that scales in terms of the number of iterations without sacrificing important performance details. Our approach offers low runtime overhead by using only a condensed version of the profile data when calculating distances and accounts for process-dependent variations by making all clustering decisions locally.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115087937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
A design methodology for domain-optimized power-efficient supercomputing 一种领域优化的节能超级计算设计方法
M. Mohiyuddin, M. Murphy, L. Oliker, J. Shalf, J. Wawrzynek, Samuel Williams
{"title":"A design methodology for domain-optimized power-efficient supercomputing","authors":"M. Mohiyuddin, M. Murphy, L. Oliker, J. Shalf, J. Wawrzynek, Samuel Williams","doi":"10.1145/1654059.1654072","DOIUrl":"https://doi.org/10.1145/1654059.1654072","url":null,"abstract":"As power has become the pre-eminent design constraint for future HPC systems, computational efficiency is being emphasized over simply peak performance. Recently, static benchmark codes have been used to find a power efficient architecture. Unfortunately, because compilers generate sub-optimal code, benchmark performance can be a poor indicator of the performance potential of architecture design points. Therefore, we present hardware/software cotuning as a novel approach for system design, in which traditional architecture space exploration is tightly coupled with software auto-tuning for delivering substantial improvements in area and power efficiency. We demonstrate the proposed methodology by exploring the parameter space of a Tensilica-based multi-processor running three of the most heavily used kernels in scientific computing, each with widely varying micro-architectural requirements: sparse matrix vector multiplication, stencil-based computations, and general matrix-matrix multiplication. Results demonstrate that co-tuning significantly improves hardware area and energy efficiency - a key driver for next generation of HPC system design.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"50 15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123003193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Autotuning multigrid with PetaBricks 使用PetaBricks自动调整多网格
Cy P. Chan, Jason Ansel, Y. Wong, Saman P. Amarasinghe, A. Edelman
{"title":"Autotuning multigrid with PetaBricks","authors":"Cy P. Chan, Jason Ansel, Y. Wong, Saman P. Amarasinghe, A. Edelman","doi":"10.1145/1654059.1654065","DOIUrl":"https://doi.org/10.1145/1654059.1654065","url":null,"abstract":"Algorithmic choice is essential in any problem domain to realizing optimal computational performance. Multigrid is a prime example: not only is it possible to make choices at the highest grid resolution, but a program can switch techniques as the problem is recursively attacked on coarser grid levels to take advantage of algorithms with different scaling behaviors. Additionally, users with different convergence criteria must experiment with parameters to yield a tuned algorithm that meets their accuracy requirements. Even after a tuned algorithm has been found, users often have to start all over when migrating from one machine to another. We present an algorithm and autotuning methodology that address these issues in a near-optimal and efficient manner. The freedom of independently tuning both the algorithm and the number of iterations at each recursion level results in an exponential search space of tuned algorithms that have different accuracies and performances. To search this space efficiently, our autotuner utilizes a novel dynamic programming method to build efficient tuned algorithms from the bottom up. The results are customized multigrid algorithms that invest targeted computational power to yield the accuracy required by the user. The techniques we describe allow the user to automatically generate tuned multigrid cycles of different shapes targeted to the user's specific combination of problem, hardware, and accuracy requirements. These cycle shapes dictate the order in which grid coarsening and grid refinement are interleaved with both iterative methods, such as Jacobi or Successive Over-Relaxation, as well as direct methods, which tend to have superior performance for small problem sizes. The need to make choices between all of these methods brings the issue of variable accuracy to the forefront. Not only must the autotuning framework compare different possible multigrid cycle shapes against each other, but it also needs the ability to compare tuned cycles against both direct and (non-multigrid) iterative methods. We address this problem by using an accuracy metric for measuring the effectiveness of tuned cycle shapes and making comparisons over all algorithmic types based on this common yardstick. In our results, we find that the flexibility to trade performance versus accuracy at all levels of recursive computation enables us to achieve excellent performance on a variety of platforms compared to algorithmically static implementations of multigrid. Our implementation uses PetaBricks, an implicitly parallel programming language where algorithmic choices are exposed in the language. The PetaBricks compiler uses these choices to analyze, autotune, and verify the PetaBricks program. These language features, most notably the autotuner, were key in enabling our implementation to be clear, correct, and fast.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"17 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114047393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors 多核处理器回旋动力学粒子到网格插值的内存效率优化
Kamesh Madduri, Samuel Williams, S. Ethier, L. Oliker, J. Shalf, E. Strohmaier, K. Yelick
{"title":"Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors","authors":"Kamesh Madduri, Samuel Williams, S. Ethier, L. Oliker, J. Shalf, E. Strohmaier, K. Yelick","doi":"10.1145/1654059.1654108","DOIUrl":"https://doi.org/10.1145/1654059.1654108","url":null,"abstract":"We present multicore parallelization strategies for the particle-to-grid interpolation step in the Gyrokinetic Toroidal Code (GTC), a 3D particle-in-cell (PIC) application to study turbulent transport in magnetic-confinement fusion devices. Particle-grid interpolation is a known performance bottleneck in several PIC applications. In GTC, this step involves particles depositing charges to a 3D toroidal mesh, and multiple particles may contribute to the charge at a grid point. We design new parallel algorithms for the GTC charge deposition kernel, and analyze their performance on three leading multicore platforms. We implement thirteen different variants for this kernel and identify the best-performing ones given typical PIC parameters such as the grid size, number of particles per cell, and the GTC-specific particle Larmor radius variation. We find that our best strategies can be 2x faster than the reference optimized MPI implementation, and our analysis provides insight into desirable architectural features for high-performance PIC simulation codes.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132046496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
SCAMPI: a scalable CAM-based algorithm for multiple pattern inspection SCAMPI:一种可扩展的基于cam的多模式检测算法
F. Petrini, Virat Agarwal, D. Pasetto
{"title":"SCAMPI: a scalable CAM-based algorithm for multiple pattern inspection","authors":"F. Petrini, Virat Agarwal, D. Pasetto","doi":"10.1145/1654059.1654106","DOIUrl":"https://doi.org/10.1145/1654059.1654106","url":null,"abstract":"String matching is one of the most compute intensive steps in a network intrusion detection system. The growing network rates, rapidly approaching 10 Gbits/sec, and the large number of signatures that need to be scanned concurrently pose very demanding challenges to algorithmic design and practical implementation. In this paper we present SCAMPI, a ground-breaking string searching algorithm that is fast, space-efficient, scalable and resilient to attacks. SCAMPI is designed with a memory-centric model of complexity in mind, to minimize memory traffic and enhance data reuse with a careful compile-time data layout. The experimental evaluation executed on two families of multicore processors, Cell B.E and Intel Xeon E5472, shows that it is possible to obtain a processing rate of more than 2 Gbits/sec per core with very large dictionaries and heavy hitting rates. In the largest tested configuration, SCAMPI reaches 16 Gbits/sec on 8 Xeon cores, reaching, and in some cases exceeding, the performance of special-purpose processors and FPGA. Using SCAMPI we have been able to scan an input stream using a dictionary of 3.5 millions keywords, more than an order of magnitude larger than any published result in the literature and in commercial prototypes, at a rate of more than 1.2 Gbits/sec per processing core.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134371306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A configurable algorithm for parallel image-compositing applications 并行图像合成应用的可配置算法
T. Peterka, David Goodell, R. Ross, Han-Wei Shen, R. Thakur
{"title":"A configurable algorithm for parallel image-compositing applications","authors":"T. Peterka, David Goodell, R. Ross, Han-Wei Shen, R. Thakur","doi":"10.1145/1654059.1654064","DOIUrl":"https://doi.org/10.1145/1654059.1654064","url":null,"abstract":"Collective communication operations can dominate the cost of large-scale parallel algorithms. Image compositing in parallel scientific visualization is a reduction operation where this is the case. We present a new algorithm called Radix-k that in many cases performs better than existing compositing algorithms. It does so through a set of configurable parameters, the radices, that determine the number of communication partners in each message round. The algorithm embodies and unifies binary swap and direct-send, two of the best-known compositing methods, and enables numerous other configurations through appropriate choices of radices. While the algorithm is not tied to a particular computing architecture or network topology, the selection of radices allows Radix-k to take advantage of new supercomputer interconnect features such as multiporting. We show scalability across image size and system size, including both powers of two and nonpowers-of-two process counts.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114856144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信