K. Arumugam, A. Godunov, D. Ranjan, B. Terzić, M. Zubair
{"title":"A memory efficient algorithm for adaptive multidimensional integration with multiple GPUs","authors":"K. Arumugam, A. Godunov, D. Ranjan, B. Terzić, M. Zubair","doi":"10.1109/HiPC.2013.6799120","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799120","url":null,"abstract":"We present a memory-efficient algorithm and its implementation for solving multidimensional numerical integration on a cluster of compute nodes with multiple GPU devices per node. The effective use of shared memory is important for improving the performance on GPUs, because of the bandwidth limitation of the global memory. The best known sequential algorithm for multidimensional numerical integration CUHRE uses a large dynamic heap data structure which is accessed frequently. Devising a GPU algorithm that caches a part of this data structure in the shared memory so as to minimizes global memory access is a challenging task. The algorithm presented here addresses this problem. Furthermore we propose a technique to scale this algorithm to multiple GPU devices. The algorithm was implemented on a cluster of Intel® Xeon® CPU X5650 compute nodes with 4 Tesla M2090 GPU devices per node. We observed a speedup of up to 240 on a single GPU device as compared to a speedup of 70 when memory optimization was not used. On a cluster of 6 nodes (24 GPU devices) we were able to obtain a speedup of up to 3250. All speedups here are with reference to the sequential implementation running on the compute node.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132657616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adding data parallelism to streaming pipelines for throughput optimization","authors":"Peng Li, Kunal Agrawal, J. Buhler, R. Chamberlain","doi":"10.1109/HiPC.2013.6799119","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799119","url":null,"abstract":"The streaming model is a popular model for writing high-throughput parallel applications. A streaming application is represented by a graph of computation stages that communicate with each other via FIFO channels. In this paper, we consider the problem of mapping streaming pipelines - streaming applications where the graph is a linear chain - onto a set of computing resources in order to maximize its throughput. In a parallel setting, subsets of stages, called components, can be mapped onto different computing resources. The throughput of an application is determined by the throughput of the slowest component. Therefore, if some stage is much slower than others, then it may be useful to replicate the stage's code and divide its workload among two or more replicas in order to increase throughput. However, pipelines may consist of some replicable and some non-replicable stages. In this paper, we address the problem of mapping these partially replicable streaming pipelines onto both homogeneous and heterogeneous platforms so as to maximize throughput. We consider two types of platforms, homogeneous platforms - where all resources are identical, and heterogeneous platforms - where resources may have different speeds. In both cases, we consider two network topologies-unidirectional chain and clique. We provide polynomial-time algorithms for mapping partially replicable pipelines onto unidirectional chains for both homogeneous and heterogeneous platforms. For homogeneous platforms, the algorithm for unidirectional chains generalizes to clique topologies. However, for heterogeneous platforms, mapping these pipelines onto clique topologies is NP-complete. We provide heuristics to generate solutions for cliques by applying our chain algorithms to a series of chains sampled from the clique. Our empirical results show that these heuristics rapidly converge to near-optimal solutions.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122247356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pai-Wei Lai, Humayun Arafat, V. Elango, P. Sadayappan
{"title":"Accelerating Strassen-Winograd's matrix multiplication algorithm on GPUs","authors":"Pai-Wei Lai, Humayun Arafat, V. Elango, P. Sadayappan","doi":"10.1109/HiPC.2013.6799109","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799109","url":null,"abstract":"In this paper, we report on the development of an efficient GPU implementation of the Strassen-Winograd matrix multiplication algorithm for matrices of arbitrary sizes. We utilize multi-kernel streaming to exploit concurrency across sub-matrix operations in addition to intra-operation parallelism. We evaluate the performance of the implementation in comparison with CUBLAS-5.0 on Fermi and Kepler GPUs. The experimental results demonstrate the usefulness of Strassen's algorithm for practically relevant matrix sizes on GPUs, with up to 1.27X speedup for single-precision and 1.42X speedup for double-precision floating point computation.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130729545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Gabbana, M. Pivanti, S. Schifano, R. Tripiccione
{"title":"Benchmarking MIC architectures with Monte Carlo simulations of spin glass systems","authors":"A. Gabbana, M. Pivanti, S. Schifano, R. Tripiccione","doi":"10.1109/HIPC.2013.6799111","DOIUrl":"https://doi.org/10.1109/HIPC.2013.6799111","url":null,"abstract":"Spin glasses - theoretical models used to capture several physical properties of real glasses - are mostly studied by Monte Carlo simulations. The associated algorithms have a very large and easily identifiable degree of available parallelism, that can also be easily cast in SIMD form. State-of-the-art multi- and many-core processors and accelerators are therefore a promising computational platform to support these Grand Challenge applications. In this paper we port and optimize for many-core processors a Monte Carlo code for the simulation of the 3D Edwards Anderson spin glass, focusing on a dual eight-core Sandy Bridge processor, and on a Xeon-Phi co-processor based on the new Many Integrated Core architecture. We present performance results, discuss bottlenecks preventing further performance gains and compare with the corresponding figures for GPU-based implementations and for application-specific dedicated machines.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128173769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Anurag Murty, V. Natarajan, Sathish S. Vadhiyar
{"title":"Efficient homology computations on multicore and manycore systems","authors":"N. Anurag Murty, V. Natarajan, Sathish S. Vadhiyar","doi":"10.1109/HiPC.2013.6799139","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799139","url":null,"abstract":"Homology computations form an important step in topological data analysis that helps to identify connected components, holes, and voids in multi-dimensional data. Our work focuses on algorithms for homology computations of large simplicial complexes on multicore machines and on GPUs. This paper presents two parallel algorithms to compute homology. A core component of both algorithms is the algebraic reduction of a cell with respect to one of its faces while preserving the homology of the original simplicial complex. The first algorithm is a parallel version of an existing sequential implementation using OpenMP. The algorithm processes and reduces cells within each partition of the complex in parallel while minimizing sequential reductions on the partition boundaries. Cache misses are reduced by ensuring data locality for data in the same partition. We observe a linear speedup on algebraic reductions and an overall speedup of up to 4.9× with 16 cores over sequential reductions. The second algorithm is based on a novel approach for homology computations on manycore/GPU architectures. This GPU algorithm is memory efficient and capable of extremely fast computation of homology for simplicial complexes with millions of simplices. We observe up to 40× speedup in runtime over sequential reductions and up to 4.5× speedup over REDHOM library, which includes the sequential algebraic reductions together with other advanced homology engines supported in the software.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131727245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing the performance impact of authorization constraints and optimizing the authorization methods for workflows","authors":"Nadeem Chaudhary, Ligang He","doi":"10.1109/HiPC.2013.6799115","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799115","url":null,"abstract":"Many workflow management systems have been developed to enhance the performance of workflow executions. The authorization policies deployed in the system may restrict the task executions. The common authorization constraints include role constraints, Separation of Duty (SoD), Binding of Duty (BoD) and temporal constraints. This paper presents the methods to check the feasibility of these constraints, and also determines the time durations when the temporal constraints will not impose negative impact on performance. Further, this paper presents an optimal authorization method, which is optimal in the sense that it can minimize a workflow's delay caused by the temporal constraints. Simulation experiments have been conducted to verify the effectiveness of the proposed authorization method. The experimental results show that comparing with the intuitive authorization method, the optimal authorization method can reduce the delay caused by the authorization constraints and consequently reduce the workflows' response time.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121080019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring energy and performance behaviors of data-intensive scientific workflows on systems with deep memory hierarchies","authors":"Marc Gamell, I. Rodero, M. Parashar, S. Poole","doi":"10.1109/HiPC.2013.6799122","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799122","url":null,"abstract":"The increasing gap between the rate at which large scale scientific simulations generate data and the corresponding storage speeds and capacities is leading to more complex system architectures with deep memory hierarchies. Advances in non-volatile memory (NVRAM) technology have made it an attractive candidate as intermediate storage in this memory hierarchy to address the latency and performance gap between main memory and disk storage. As a result, it is important to understand and model its energy/performance behavior from an application perspective as well as how it can be effectively used for staging data within an application workflow. In this paper, we target a NVRAM-based deep memory hierarchy and explore its potential for supporting in-situ/in-transit data analytics pipelines that are part of application workflows patterns. Specifically, we model the memory hierarchy and experimentally explore energy/performance behaviors of different data management strategies and data exchange patterns, as well as the tradeoffs associated with data placement, data movement and data processing.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125235097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hybrid shared memory heterogeneous execution platform for PCIe-based GPGPUs","authors":"S. Shukla, L. Bhuyan","doi":"10.1109/HiPC.2013.6799140","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799140","url":null,"abstract":"The disparity between the CPU and GPU domains has forced the programmers to adhere to the traditional driver-based GPU programming approach. The negative implications of this approach are inter-domain data transfer overhead, host memory pressure and CPU underutilization. In this paper, we propose a novel hybrid shared memory-based execution approach to enhance the throughput of the General Purpose GPU(GPGPU) applications. To achive optimal GPU execution, we adopted a midway approach between the shared memory and traditional disjoint memory GPU programming approach. Our design involves OS enhancements and extensions to an OS-integrated open-source GPU driver(GDev) which together provide the GPU application a shared memory execution platform. Our design not only eliminates several drawbacks associated with the traditional GPU programming approach, but allows data-parallel execution across CPUs and GPU.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127193945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucas T. Melo, G. Menezes, A. Silva-Filho, M. Lima
{"title":"Performance and energy consumption analysis of a seismic application for three different architectures intended for oil and gas industry","authors":"Lucas T. Melo, G. Menezes, A. Silva-Filho, M. Lima","doi":"10.1109/HiPC.2013.6799114","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799114","url":null,"abstract":"In the most cases, seismic migration applications demand considerable computing throughput, since this feature denotes a need to apply complex models that are continuously run to evaluate drilling of petroleum wells. Given to the inherent computational complexity and the immense amount of processed data, High-Performance Computing (HPC) solutions can be attractive to perform this kind of application. Nowadays, low energy consumption has been highly desirable in high-performance processing when applied to large clusters that continuously run certain applications. Thereby, the use of an architecture that combines high performance and low energy consumption is desirable. This work describes an analysis in terms of performance, energy consumption and cost for three different architectures (Multicore, FPGA and GPGPU) intended to process a seismic application based on RTM (Reverse Time Migration) algorithm for an industrial application with this objective. Results indicate that the GPGPU architecture achieved the best performance in terms of energy consumption. When compared to the Multicore architecture, an about 15 times higher efficiency/Joule was observed. This architecture performed the RTM algorithm about 32 times faster when compared with the non-optimized implementation on CPU.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128863133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient sparse matrix multiple-vector multiplication using a bitmapped format","authors":"Ramaseshan Kannan","doi":"10.1109/HiPC.2013.6799135","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799135","url":null,"abstract":"The problem of obtaining high computational throughput from sparse matrix multiple-vector multiplication routines is considered. Current sparse matrix formats and algorithms have high bandwidth requirements and poor reuse of cache and register loaded entries, which restrict their performance. We propose the mapped blocked row format: a bitmapped sparse matrix format that stores entries as blocks without a fill overhead, thereby offering blocking without additional storage and bandwidth overheads. An efficient algorithm decodes bitmaps using de Bruijn sequences and minimizes the number of conditionals evaluated. Performance is compared with that of popular formats, including vendor implementations of sparse BLAS. Our sparse matrix multiple-vector multiplication algorithm achieves high throughput on all platforms and is implemented using platform neutral optimizations.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129035455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}