2011 IEEE International Parallel & Distributed Processing Symposium最新文献_第5页

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators 基于多个GPU加速器的多核节点QR分解

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.90

E. Agullo, C. Augonnet, J. Dongarra, Mathieu Faverge, H. Ltaief, Samuel Thibault, S. Tomov

{"title":"QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators","authors":"E. Agullo, C. Augonnet, J. Dongarra, Mathieu Faverge, H. Ltaief, Samuel Thibault, S. Tomov","doi":"10.1109/IPDPS.2011.90","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.90","url":null,"abstract":"One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators-based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the benefits of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an efficient runtime system, Star PU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high performance in a portable way thanks to a fine interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bounds that we obtained using Linear Programming.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115097953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 117

I/O-Optimal Distribution Sweeping on Private-Cache Chip Multiprocessors 私有缓存芯片多处理器的I/ o最优分布扫描

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.106

Deepak Ajwani, Nodari Sitchinava, N. Zeh

{"title":"I/O-Optimal Distribution Sweeping on Private-Cache Chip Multiprocessors","authors":"Deepak Ajwani, Nodari Sitchinava, N. Zeh","doi":"10.1109/IPDPS.2011.106","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.106","url":null,"abstract":"The parallel external memory (PEM) model has been used as a basis for the design and analysis of a wide range of algorithms for private-cache multi-core architectures. As a tool for developing geometric algorithms in this model, a parallel version of the I/O-efficient distribution sweeping framework was introduced recently, and a number of algorithms for problems on axis-aligned objects were obtained using this framework. The obtained algorithms were efficient but not optimal. In this paper, we improve the framework to obtain algorithms with the optimal I/O complexity of $O(sort {P}(N) + K/PB)$ for a number of problems on axis-aligned objects, $P$ denotes the number of cores/processors, $B$ denotes the number of elements that fit in a cache line, $N$ and $K$ denote the sizes of the input and output, respectively, and $sort {P}(N)$ denotes the I/O complexity of sorting $N$ items using $P$ processors in the PEM model. To obtain the above improvement, we present a new one-dimensional batched range counting algorithm on a sorted list of ranges and points that achieves an I/O complexity of $O((N + K)/PB)$, where $K$ is the sum of the counts of all the ranges. The key to achieving efficient load balancing among the processors in this algorithm is a new method to count the output without enumerating it, which might be of independent interest.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123878253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms 分析异构多gpu系统加速皮质启发学习算法

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.88

Andrew Nere, Atif Hashmi, Mikko H. Lipasti

{"title":"Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms","authors":"Andrew Nere, Atif Hashmi, Mikko H. Lipasti","doi":"10.1109/IPDPS.2011.88","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.88","url":null,"abstract":"Recent advances in neuroscientific understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-efficient possibility. Such attributes have once again sparked an interest in creating learning algorithms that aspire to reverse-engineer many of the abilities of the brain. In this paper we describe a GPGPU-accelerated extension to an intelligent learning model inspired by the structural and functional properties of the mammalian neocortex. Our cortical network, like the brain, exhibits massive amounts of processing parallelism, making today's GPGPUs a highly attractive and readily-available hardware accelerator for such a model. Furthermore, we consider two inefficiencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. We propose optimizations such as a software work-queue structure and pipelining the hierarchical layers of the cortical network to mitigate such problems. Our analysis provides important insight into the GPU architecture details including the number of cores, the memory system, and the global thread scheduler. Additionally, we create a runtime profiling tool for our parallel learning algorithm which proportionally distributes the cortical network across the host CPU as well as multiple GPUs, whether homogeneous or heterogeneous, that may be available to the system. Using the profiling tool with these optimizations on Nvidia's CUDA framework, we achieve up to 60x speedup over a single-threaded CPU implementation of the model.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130291889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Power, Programmability, and Granularity: The Challenges of ExaScale Computing 能力、可编程性和粒度:ExaScale计算的挑战

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.420

B. Dally

引用次数: 81

Exploiting Data Similarity to Reduce Memory Footprints 利用数据相似度减少内存占用

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.24

Susmit Biswas, B. Supinski, M. Schulz, D. Franklin, T. Sherwood, F. Chong

{"title":"Exploiting Data Similarity to Reduce Memory Footprints","authors":"Susmit Biswas, B. Supinski, M. Schulz, D. Franklin, T. Sherwood, F. Chong","doi":"10.1109/IPDPS.2011.24","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.24","url":null,"abstract":"Memory size has long limited large-scale applications on high-performance computing (HPC) systems. Since compute nodes frequently do not have swap space, physical memory often limits problem sizes. Increasing core counts per chip and power density constraints, which limit the number of DIMMs per node, have exacerbated this problem. Further, DRAM constitutes a significant portion of overall HPC system cost. Therefore, instead of adding more DRAM to the nodes, mechanisms to manage memory usage more efficiently -- preferably transparently -- could increase effective DRAM capacity and thus the benefit of multicore nodes for HPC systems. MPI application processes often exhibit significant data similarity. These data regions occupy multiple physical locations across the individual rank processes within a multicore node and thus offer a potential savings in memory capacity. These regions, primarily residing in heap, are dynamic, which makes them difficult to manage statically. Our novel memory allocation library, {it SBLLmallocShort}, automatically identifies identical memory blocks and merges them into a single copy. Our implementation is transparent to the application and does not require any kernel modifications. Overall, we demonstrate that {it SBLLmalloc} reduces the memory footprint of a range of MPI applications by $32.03%$ on average and up to $60.87%$. Further, {it SBLLmalloc} supports problem sizes for IRS over $21.36%$ larger than using standard memory management techniques, thus significantly increasing effective system size. Similarly, {it SBLLmalloc} requires $43.75%$ fewer nodes than standard memory management techniques to solve an AMG problem.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131239295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Computing Strongly Connected Components in Parallel on CUDA 在CUDA上并行计算强连接组件

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.59

J. Barnat, Petr Bauch, L. Brim, Milan Ceska

引用次数: 99

A Study of Parallel Particle Tracing for Steady-State and Time-Varying Flow Fields 稳态和时变流场的平行粒子跟踪研究

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.62

T. Peterka, R. Ross, B. Nouanesengsy, Teng-Yok Lee, Han-Wei Shen, W. Kendall, Jian Huang

引用次数: 92

Improved Algorithms for the Distributed Trigger Counting Problem 分布式触发计数问题的改进算法

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.56

Venkatesan T. Chakaravarthy, Anamitra R. Choudhury, Yogish Sabharwal

{"title":"Improved Algorithms for the Distributed Trigger Counting Problem","authors":"Venkatesan T. Chakaravarthy, Anamitra R. Choudhury, Yogish Sabharwal","doi":"10.1109/IPDPS.2011.56","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.56","url":null,"abstract":"Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting (DTC) problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-specified input. The problem has applications in monitoring, global snapshots, synchronizers and other distributed settings. In this paper, we present two decentralized and randomized algorithms for the DTC problem. The first algorithm has message complexity O(n log w) and no processor receives more than O(log w) messages with high probability. It does not provide any bound on the messages sent per processor. This algorithm assumes complete connectivity between the processors. The second algorithm has message complexity O(n log n log w) and no processor exchanges more than O(log n log w) messages with high probability. However, there is a negligible failure probability in raising the alert on receiving w triggers. This algorithm only requires that a constant degree tree be embeddable in the underlying communication graph.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123958286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

CATCH: A Cloud-Based Adaptive Data Transfer Service for HPC 捕获:基于云的HPC自适应数据传输服务

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.118

H. M. Monti, A. Butt, Sudharshan S. Vazhkudai

{"title":"CATCH: A Cloud-Based Adaptive Data Transfer Service for HPC","authors":"H. M. Monti, A. Butt, Sudharshan S. Vazhkudai","doi":"10.1109/IPDPS.2011.118","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.118","url":null,"abstract":"Modern High Performance Computing (HPC) applications process very large amounts of data. A critical research challenge lies in transporting input data to the HPC center from a number of distributed sources, e.g., scientific experiments and web repositories, etc., and offloading the result data to geographically distributed, intermittently available end-users, often over under-provisioned connections. Such end-user data services are typically performed using point-to-point transfers that are designed for well-endowed sites and are unable to reconcile the center's resource usage and users' delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. To overcome these inefficiencies, decentralized HPC data services are emerging as viable alternatives. In this paper, we develop and enhance such distributed data services by designing CATCH, a Cloud-based Adaptive data Transfer service for HPC. CATCH leverages a bevy of cloud storage resources to orchestrate a decentralized data transport with fail-over capabilities. Our results demonstrate that CATCH is a feasible approach, and can help improve the data transfer times at the HPC center by as much as 81.1% for typical HPC workloads.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116959333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures 现代微架构下并行迭代模板计算的代码生成和自动调优框架

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.70

M. Christen, O. Schenk, H. Burkhart

引用次数: 333