20th Annual International Conference on High Performance Computing最新文献_第2页

A memory efficient algorithm for adaptive multidimensional integration with multiple GPUs 多gpu自适应多维集成的高效内存算法

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799120

K. Arumugam, A. Godunov, D. Ranjan, B. Terzić, M. Zubair

{"title":"A memory efficient algorithm for adaptive multidimensional integration with multiple GPUs","authors":"K. Arumugam, A. Godunov, D. Ranjan, B. Terzić, M. Zubair","doi":"10.1109/HiPC.2013.6799120","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799120","url":null,"abstract":"We present a memory-efficient algorithm and its implementation for solving multidimensional numerical integration on a cluster of compute nodes with multiple GPU devices per node. The effective use of shared memory is important for improving the performance on GPUs, because of the bandwidth limitation of the global memory. The best known sequential algorithm for multidimensional numerical integration CUHRE uses a large dynamic heap data structure which is accessed frequently. Devising a GPU algorithm that caches a part of this data structure in the shared memory so as to minimizes global memory access is a challenging task. The algorithm presented here addresses this problem. Furthermore we propose a technique to scale this algorithm to multiple GPU devices. The algorithm was implemented on a cluster of Intel® Xeon® CPU X5650 compute nodes with 4 Tesla M2090 GPU devices per node. We observed a speedup of up to 240 on a single GPU device as compared to a speedup of 70 when memory optimization was not used. On a cluster of 6 nodes (24 GPU devices) we were able to obtain a speedup of up to 3250. All speedups here are with reference to the sequential implementation running on the compute node.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132657616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Adding data parallelism to streaming pipelines for throughput optimization 将数据并行性添加到流管道中以实现吞吐量优化

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799119

Peng Li, Kunal Agrawal, J. Buhler, R. Chamberlain

{"title":"Adding data parallelism to streaming pipelines for throughput optimization","authors":"Peng Li, Kunal Agrawal, J. Buhler, R. Chamberlain","doi":"10.1109/HiPC.2013.6799119","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799119","url":null,"abstract":"The streaming model is a popular model for writing high-throughput parallel applications. A streaming application is represented by a graph of computation stages that communicate with each other via FIFO channels. In this paper, we consider the problem of mapping streaming pipelines - streaming applications where the graph is a linear chain - onto a set of computing resources in order to maximize its throughput. In a parallel setting, subsets of stages, called components, can be mapped onto different computing resources. The throughput of an application is determined by the throughput of the slowest component. Therefore, if some stage is much slower than others, then it may be useful to replicate the stage's code and divide its workload among two or more replicas in order to increase throughput. However, pipelines may consist of some replicable and some non-replicable stages. In this paper, we address the problem of mapping these partially replicable streaming pipelines onto both homogeneous and heterogeneous platforms so as to maximize throughput. We consider two types of platforms, homogeneous platforms - where all resources are identical, and heterogeneous platforms - where resources may have different speeds. In both cases, we consider two network topologies-unidirectional chain and clique. We provide polynomial-time algorithms for mapping partially replicable pipelines onto unidirectional chains for both homogeneous and heterogeneous platforms. For homogeneous platforms, the algorithm for unidirectional chains generalizes to clique topologies. However, for heterogeneous platforms, mapping these pipelines onto clique topologies is NP-complete. We provide heuristics to generate solutions for cliques by applying our chain algorithms to a series of chains sampled from the clique. Our empirical results show that these heuristics rapidly converge to near-optimal solutions.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122247356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Accelerating Strassen-Winograd's matrix multiplication algorithm on GPUs 在gpu上加速Strassen-Winograd矩阵乘法算法

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799109

Pai-Wei Lai, Humayun Arafat, V. Elango, P. Sadayappan

引用次数: 16

Benchmarking MIC architectures with Monte Carlo simulations of spin glass systems 用自旋玻璃系统的蒙特卡罗模拟对标MIC架构

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HIPC.2013.6799111

A. Gabbana, M. Pivanti, S. Schifano, R. Tripiccione

引用次数: 3

Efficient homology computations on multicore and manycore systems 多核和多核系统的高效同源计算

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799139

N. Anurag Murty, V. Natarajan, Sathish S. Vadhiyar

{"title":"Efficient homology computations on multicore and manycore systems","authors":"N. Anurag Murty, V. Natarajan, Sathish S. Vadhiyar","doi":"10.1109/HiPC.2013.6799139","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799139","url":null,"abstract":"Homology computations form an important step in topological data analysis that helps to identify connected components, holes, and voids in multi-dimensional data. Our work focuses on algorithms for homology computations of large simplicial complexes on multicore machines and on GPUs. This paper presents two parallel algorithms to compute homology. A core component of both algorithms is the algebraic reduction of a cell with respect to one of its faces while preserving the homology of the original simplicial complex. The first algorithm is a parallel version of an existing sequential implementation using OpenMP. The algorithm processes and reduces cells within each partition of the complex in parallel while minimizing sequential reductions on the partition boundaries. Cache misses are reduced by ensuring data locality for data in the same partition. We observe a linear speedup on algebraic reductions and an overall speedup of up to 4.9× with 16 cores over sequential reductions. The second algorithm is based on a novel approach for homology computations on manycore/GPU architectures. This GPU algorithm is memory efficient and capable of extremely fast computation of homology for simplicial complexes with millions of simplices. We observe up to 40× speedup in runtime over sequential reductions and up to 4.5× speedup over REDHOM library, which includes the sequential algebraic reductions together with other advanced homology engines supported in the software.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131727245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Analyzing the performance impact of authorization constraints and optimizing the authorization methods for workflows 分析授权约束对性能的影响，优化工作流的授权方法

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799115

Nadeem Chaudhary, Ligang He

引用次数: 2

Exploring energy and performance behaviors of data-intensive scientific workflows on systems with deep memory hierarchies 探索具有深度内存层次的系统上数据密集型科学工作流的能量和性能行为

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799122

Marc Gamell, I. Rodero, M. Parashar, S. Poole

{"title":"Exploring energy and performance behaviors of data-intensive scientific workflows on systems with deep memory hierarchies","authors":"Marc Gamell, I. Rodero, M. Parashar, S. Poole","doi":"10.1109/HiPC.2013.6799122","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799122","url":null,"abstract":"The increasing gap between the rate at which large scale scientific simulations generate data and the corresponding storage speeds and capacities is leading to more complex system architectures with deep memory hierarchies. Advances in non-volatile memory (NVRAM) technology have made it an attractive candidate as intermediate storage in this memory hierarchy to address the latency and performance gap between main memory and disk storage. As a result, it is important to understand and model its energy/performance behavior from an application perspective as well as how it can be effectively used for staging data within an application workflow. In this paper, we target a NVRAM-based deep memory hierarchy and explore its potential for supporting in-situ/in-transit data analytics pipelines that are part of application workflows patterns. Specifically, we model the memory hierarchy and experimentally explore energy/performance behaviors of different data management strategies and data exchange patterns, as well as the tradeoffs associated with data placement, data movement and data processing.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125235097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

A hybrid shared memory heterogeneous execution platform for PCIe-based GPGPUs 基于pcie的gpgpu的混合共享内存异构执行平台

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799140

S. Shukla, L. Bhuyan

引用次数: 2

Performance and energy consumption analysis of a seismic application for three different architectures intended for oil and gas industry 针对石油和天然气行业的三种不同架构的地震应用程序的性能和能耗分析

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799114

Lucas T. Melo, G. Menezes, A. Silva-Filho, M. Lima

{"title":"Performance and energy consumption analysis of a seismic application for three different architectures intended for oil and gas industry","authors":"Lucas T. Melo, G. Menezes, A. Silva-Filho, M. Lima","doi":"10.1109/HiPC.2013.6799114","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799114","url":null,"abstract":"In the most cases, seismic migration applications demand considerable computing throughput, since this feature denotes a need to apply complex models that are continuously run to evaluate drilling of petroleum wells. Given to the inherent computational complexity and the immense amount of processed data, High-Performance Computing (HPC) solutions can be attractive to perform this kind of application. Nowadays, low energy consumption has been highly desirable in high-performance processing when applied to large clusters that continuously run certain applications. Thereby, the use of an architecture that combines high performance and low energy consumption is desirable. This work describes an analysis in terms of performance, energy consumption and cost for three different architectures (Multicore, FPGA and GPGPU) intended to process a seismic application based on RTM (Reverse Time Migration) algorithm for an industrial application with this objective. Results indicate that the GPGPU architecture achieved the best performance in terms of energy consumption. When compared to the Multicore architecture, an about 15 times higher efficiency/Joule was observed. This architecture performed the RTM algorithm about 32 times faster when compared with the non-optimized implementation on CPU.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128863133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient sparse matrix multiple-vector multiplication using a bitmapped format 使用位图格式的高效稀疏矩阵多向量乘法

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799135

Ramaseshan Kannan

引用次数: 19