2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum最新文献_第10页

An Empirical Performance Study of Chapel Programming Language Chapel程序设计语言的实证性能研究

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI: 10.1109/IPDPSW.2012.64

N. Dun, K. Taura

引用次数: 14

Optimize Block-Level Cloud Storage System with Load-Balance Strategy 基于负载均衡策略的块级云存储系统优化

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI: 10.1109/IPDPSW.2012.267

Li Zhou, Yicheng Wang, Jilin Zhang, Jian Wan, Yongjian Ren

引用次数: 12

Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication 利用CUDA进程间通信优化多gpu系统上的MPI通信

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI: 10.1109/IPDPSW.2012.228

S. Potluri, Hao Wang, Devendar Bureddy, A. Singh, C. Rosales, D. Panda

{"title":"Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication","authors":"S. Potluri, Hao Wang, Devendar Bureddy, A. Singh, C. Rosales, D. Panda","doi":"10.1109/IPDPSW.2012.228","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.228","url":null,"abstract":"Many modern clusters are being equipped with multiple GPUs per node to achieve better compute density and power efficiency. However, moving data in/out of GPUs continues to remain a major performance bottleneck. With CUDA 4.1, NVIDIA has introduced Inter-Process Communication (IPC) to address data movement overheads between processes using different GPUs connected to the same node. State-of-the-art MPI libraries like MVAPICH2 are being modified to allow application developers to use MPI calls directly over GPU device memory. This improves the programmability for application developers by removing the burden of dealing with complex data movement optimizations. In this paper, we propose efficient designs for intra-node MPI communication on multi-GPU nodes, taking advantage of IPC capabilities provided in CUDA. We also demonstrate how MPI one-sided communication semantics can provide better performance and overlap by taking advantage of IPC and the Direct Memory Access (DMA) engine on a GPU. We demonstrate the effectiveness of our designs using micro-benchmarks and an application. The proposed designs improve GPU-to-GPU MPI Send/Receive latency for 4MByte messages by 79% and achieve 4 times the bandwidth for the same message size. One-sided communication using Put and Active synchronization shows 74% improvement in latency for 4MByte message, compared to the existing Send/Receive based implementation. Our benchmark using Get and Passive Synchronization demonstrates that true asynchronous progress can be achieved using IPC and the GPU DMA engine. Our designs for two-sided and one-sided communication improve the performance of GPULBM, a CUDA implementation of Lattice Boltzmann Method for multiphase flows, by 16%, compared to the performance using existing designs in MVAPICH2. To the best of our knowledge, this is the first paper to provide a comprehensive solution for MPI two-sided and one-sided GPU-to-GPU communication within a node, using CUDA IPC.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134425090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 75

Integrated Parallelization of Computations and Visualization for Large-scale Applications 大规模应用的集成并行化计算和可视化

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI: 10.1109/IPDPSW.2012.314

Preeti Malakar, V. Natarajan, Sathish S. Vadhiyar

{"title":"Integrated Parallelization of Computations and Visualization for Large-scale Applications","authors":"Preeti Malakar, V. Natarajan, Sathish S. Vadhiyar","doi":"10.1109/IPDPSW.2012.314","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.314","url":null,"abstract":"Critical applications like cyclone tracking and earthquake modeling require simultaneous high-performance simulations and online visualization for timely analysis. Faster simulations and simultaneous visualization enable scientists provide real-time guidance to decision makers. In this work, we have developed an integrated user-driven and automated steering framework that simultaneously performs numerical simulations and efficient online remote visualization of critical weather applications in resource-constrained environments. It considers application dynamics like the criticality of the application and resource dynamics like the storage space, network bandwidth and available number of processors to adapt various application and resource parameters like simulation resolution, simulation rate and the frequency of visualization. We formulate the problem of finding an optimal set of simulation parameters as a linear programming problem. This leads to 30% higher simulation rate and 25-50% lesser storage consumption than a naive greedy approach. The framework also provides the user control over various application parameters like region of interest and simulation resolution. We have also devised an adaptive algorithm to reduce the lag between the simulation and visualization times. Using experiments with different network bandwidths, we find that our adaptive algorithm is able to reduce lag as well as visualize the most representative frames.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128981221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Incorporating the NSF/TCPP Curriculum Recommendations in a Liberal Arts Setting 将NSF/TCPP课程建议纳入文科设置

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI: 10.1109/IPDPSW.2012.165

Akshaye Dhawan

引用次数: 3

Self-Adaptive Heterogeneous Cluster with Wireless Network 无线网络自适应异构集群

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI: 10.1109/IPDPSW.2012.37

Xinyu Niu, K. H. Tsoi, W. Luk

{"title":"Self-Adaptive Heterogeneous Cluster with Wireless Network","authors":"Xinyu Niu, K. H. Tsoi, W. Luk","doi":"10.1109/IPDPSW.2012.37","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.37","url":null,"abstract":"The high performance computing (HPC) community has been exploring novel platforms to push performance forward. Field Programmable Logic Arrays (FPGAs) and Graphics Processing Units (GPUs) have been widely used as accelerators for computational intensive applications. Heterogeneous cluster is one of the promising platforms as it combines characteristics of multiple processing elements, to meet requirements of various applications. In this work, we build a self-adaptive framework for heterogeneous clusters, coupled with a customised wireless network. A runtime cluster model is implemented to predict throughput, power and thermal merits for heterogeneous clusters. Cluster configurations are scheduled to improve cluster power efficiency, as well as to reduce peak temperature of processing elements. Results show that, for monitoring operations upon heterogeneous clusters, the customised wireless network provides stable and scalable performance for negligible overhead. A high performance application is developed under the proposed framework. Experiments show that this approach can improve both power efficiency and energy efficiency of N-body simulation for more than 15 times, while reducing device peak temperature by up to 12° C.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124522785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Communication Library to Overlap Computation and Communication for OpenCL Application 用于OpenCL应用程序的通信库重叠计算和通信

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI: 10.1109/IPDPSW.2012.68

T. Komoda, Shinobu Miwa, Hiroshi Nakamura

{"title":"Communication Library to Overlap Computation and Communication for OpenCL Application","authors":"T. Komoda, Shinobu Miwa, Hiroshi Nakamura","doi":"10.1109/IPDPSW.2012.68","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.68","url":null,"abstract":"User-friendly parallel programming environments, such as CUDA and OpenCL are widely used for accelerators. They provide programmers with useful APIs, but the APIs are still low level primitives. Therefore, in order to apply communication optimization techniques, such as double buffering techniques, programmers have to manually write the programs with the primitives. Manual communication optimization requires programmers to have significant knowledge of both application characteristics and CPU-accelerator architecture. This prevents many application developers from effective utilization of accelerators. In addition, managing communication is a tedious and error-prone task even for expert programmers. Thus, it is necessary to develop a communication system which is highly abstracted but still capable of optimization. For this purpose, this paper proposes an OpenCL based communication library. To maximize performance improvement, the proposed library provides a simple but effective programming interface based on Stream Graph in order to specify an applications communication pattern. We have implemented a prototype system on OpenCL platform and applied it to several image processing applications. Our evaluation shows that the library successfully masks the details of accelerator memory management while it can achieve comparable speedup to manual optimization in which we use existing low level interfaces.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121340038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

A 3N Approach to Network Control and Management 网络控制与管理的3N方法

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI: 10.1109/IPDPSW.2012.156

Feng Zhao, Dan Zhao, Xiaofeng Hu, Wei Peng, Baosheng Wang, Zexin Lu

{"title":"A 3N Approach to Network Control and Management","authors":"Feng Zhao, Dan Zhao, Xiaofeng Hu, Wei Peng, Baosheng Wang, Zexin Lu","doi":"10.1109/IPDPSW.2012.156","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.156","url":null,"abstract":"As the network technology and applications continue to evolve, computer networks become more and more important. However, network users can attack the network infrastructure (such as domain name service and routing services, etc.). The networks can not provide the minimum required quality of service for control. Network situation can not be aware in a timely manner. And network maintaining and upgrading are not easy. We argue that one root cause of these problems is that control, management and forwarding function are intertwined tightly. We advocate a complete loosing of the functionality and propose an extreme design point that we call \"3N\", after the architecture's three separated networks: forwarding network, control network and management network. Accordingly, we introduce four network entities: forwarder, controller, manager and separators. In the 3N architecture, the forwarding network mainly forwards packets at the behest of the control network and the management network, the control network mainly perform route computation for the data network, and the management network mainly learn about the situation of the data network and distribute policies and configurations, and the three networks working together to consist a efficient network system. In this paper we presented a high level overview of 3N architecture and some research considerations in its realization. We think the 3N architecture is helpful to improve network security, availability, manageability, scalability and so on.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117005230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Parallel Resampling Algorithm for Particle Filtering on Shared-Memory Architectures 共享内存结构下粒子滤波的并行重采样算法

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI: 10.1109/IPDPSW.2012.184

Peng Gong, Y. O. Basciftci, F. Özgüner

引用次数: 24

Communication-Optimal Parallel N-body Solvers 通信-最优并行n体求解器

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI: 10.1109/IPDPSW.2012.303

Aparna Chandramowlishwaran, R. Vuduc

{"title":"Communication-Optimal Parallel N-body Solvers","authors":"Aparna Chandramowlishwaran, R. Vuduc","doi":"10.1109/IPDPSW.2012.303","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.303","url":null,"abstract":"We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N-body problems. Our research specifically addresses two key challenges. The first challenge is how to engineer fast code for today's platforms. We present the first in-depth study of multicore optimizations and tuning for FMM, along with a systematic approach for transforming a conventionally parallelized FMM into a highly-tuned one. We introduce novel optimizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and many core systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter-node communication costs. This analysis yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs-if there are no significant change-could cause it to become communication-bound as early as the year 2020. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of high-level algorithm-architecture co-design.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116177755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0