{"title":"An Empirical Performance Study of Chapel Programming Language","authors":"N. Dun, K. Taura","doi":"10.1109/IPDPSW.2012.64","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.64","url":null,"abstract":"In this paper we evaluate the performance of the Chapel programming language from the perspective of its language primitives and features, where the microbenchmarks are synthesized from our lessons learned in developing molecular dynamics simulation programs in Chapel. Experimental results show that most language building blocks have comparable performance to corresponding hand-written C code, while the complex applications can achieve up to 70% of the performance of C implementation. We identify several causes of overhead that can be further optimized by Chapel compiler. This work not only helps Chapel users understand the performance implication of using Chapel, but also provides useful feedbacks for Chapel developers to make a better compiler.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134181416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Zhou, Yicheng Wang, Jilin Zhang, Jian Wan, Yongjian Ren
{"title":"Optimize Block-Level Cloud Storage System with Load-Balance Strategy","authors":"Li Zhou, Yicheng Wang, Jilin Zhang, Jian Wan, Yongjian Ren","doi":"10.1109/IPDPSW.2012.267","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.267","url":null,"abstract":"Cloud storage systems take advantage of distributed storage technology and virtualization technology, to provide virtual machine clients with customizable storage service. They can be divided into two types: distributed file system and block level storage system. Orthrus is a Light weighted Block-Level Cloud Storage System, which adopt multiple volume servers' architecture to avoid single-point problem in other solutions. However, how to make the servers load balance turn into a new problem appears in this architecture. In this paper we present a dynamic load balance strategy between multiple volume servers. We characterize machine capability and load quantity with black box modeling approach, and implement the load balance strategy based on genetic algorithm. Extensive experimental results show that the aggregated I/O throughputs of ORTHRUS are remarkably improved (about two times) with multiple volume servers, and both I/O throughputs and IOPS are remarkably improved (about 1.8 and 1.2 times respectively) by our dynamic load balance strategy.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134325245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Potluri, Hao Wang, Devendar Bureddy, A. Singh, C. Rosales, D. Panda
{"title":"Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication","authors":"S. Potluri, Hao Wang, Devendar Bureddy, A. Singh, C. Rosales, D. Panda","doi":"10.1109/IPDPSW.2012.228","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.228","url":null,"abstract":"Many modern clusters are being equipped with multiple GPUs per node to achieve better compute density and power efficiency. However, moving data in/out of GPUs continues to remain a major performance bottleneck. With CUDA 4.1, NVIDIA has introduced Inter-Process Communication (IPC) to address data movement overheads between processes using different GPUs connected to the same node. State-of-the-art MPI libraries like MVAPICH2 are being modified to allow application developers to use MPI calls directly over GPU device memory. This improves the programmability for application developers by removing the burden of dealing with complex data movement optimizations. In this paper, we propose efficient designs for intra-node MPI communication on multi-GPU nodes, taking advantage of IPC capabilities provided in CUDA. We also demonstrate how MPI one-sided communication semantics can provide better performance and overlap by taking advantage of IPC and the Direct Memory Access (DMA) engine on a GPU. We demonstrate the effectiveness of our designs using micro-benchmarks and an application. The proposed designs improve GPU-to-GPU MPI Send/Receive latency for 4MByte messages by 79% and achieve 4 times the bandwidth for the same message size. One-sided communication using Put and Active synchronization shows 74% improvement in latency for 4MByte message, compared to the existing Send/Receive based implementation. Our benchmark using Get and Passive Synchronization demonstrates that true asynchronous progress can be achieved using IPC and the GPU DMA engine. Our designs for two-sided and one-sided communication improve the performance of GPULBM, a CUDA implementation of Lattice Boltzmann Method for multiphase flows, by 16%, compared to the performance using existing designs in MVAPICH2. To the best of our knowledge, this is the first paper to provide a comprehensive solution for MPI two-sided and one-sided GPU-to-GPU communication within a node, using CUDA IPC.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134425090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrated Parallelization of Computations and Visualization for Large-scale Applications","authors":"Preeti Malakar, V. Natarajan, Sathish S. Vadhiyar","doi":"10.1109/IPDPSW.2012.314","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.314","url":null,"abstract":"Critical applications like cyclone tracking and earthquake modeling require simultaneous high-performance simulations and online visualization for timely analysis. Faster simulations and simultaneous visualization enable scientists provide real-time guidance to decision makers. In this work, we have developed an integrated user-driven and automated steering framework that simultaneously performs numerical simulations and efficient online remote visualization of critical weather applications in resource-constrained environments. It considers application dynamics like the criticality of the application and resource dynamics like the storage space, network bandwidth and available number of processors to adapt various application and resource parameters like simulation resolution, simulation rate and the frequency of visualization. We formulate the problem of finding an optimal set of simulation parameters as a linear programming problem. This leads to 30% higher simulation rate and 25-50% lesser storage consumption than a naive greedy approach. The framework also provides the user control over various application parameters like region of interest and simulation resolution. We have also devised an adaptive algorithm to reduce the lag between the simulation and visualization times. Using experiments with different network bandwidths, we find that our adaptive algorithm is able to reduce lag as well as visualize the most representative frames.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128981221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incorporating the NSF/TCPP Curriculum Recommendations in a Liberal Arts Setting","authors":"Akshaye Dhawan","doi":"10.1109/IPDPSW.2012.165","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.165","url":null,"abstract":"This paper examines the integration of the NSF/TCPP Core Curriculum Recommendations in a liberal arts undergraduate setting. We examine how parallel and distributed computing concepts can be incorporated across the breadth of the undergraduate curriculum. As a model of such an integration, changes are proposed to Data Structures and Design and Analysis of Algorithms. These changes were implemented in Design and Analysis of Algorithms and the results were compared to previous iterations of that course taught by the same instructor. The student feedback received shows that the introduction of these topics made the course more engaging and conveyed an adequate introduction to this material.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122937159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-Adaptive Heterogeneous Cluster with Wireless Network","authors":"Xinyu Niu, K. H. Tsoi, W. Luk","doi":"10.1109/IPDPSW.2012.37","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.37","url":null,"abstract":"The high performance computing (HPC) community has been exploring novel platforms to push performance forward. Field Programmable Logic Arrays (FPGAs) and Graphics Processing Units (GPUs) have been widely used as accelerators for computational intensive applications. Heterogeneous cluster is one of the promising platforms as it combines characteristics of multiple processing elements, to meet requirements of various applications. In this work, we build a self-adaptive framework for heterogeneous clusters, coupled with a customised wireless network. A runtime cluster model is implemented to predict throughput, power and thermal merits for heterogeneous clusters. Cluster configurations are scheduled to improve cluster power efficiency, as well as to reduce peak temperature of processing elements. Results show that, for monitoring operations upon heterogeneous clusters, the customised wireless network provides stable and scalable performance for negligible overhead. A high performance application is developed under the proposed framework. Experiments show that this approach can improve both power efficiency and energy efficiency of N-body simulation for more than 15 times, while reducing device peak temperature by up to 12° C.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124522785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication Library to Overlap Computation and Communication for OpenCL Application","authors":"T. Komoda, Shinobu Miwa, Hiroshi Nakamura","doi":"10.1109/IPDPSW.2012.68","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.68","url":null,"abstract":"User-friendly parallel programming environments, such as CUDA and OpenCL are widely used for accelerators. They provide programmers with useful APIs, but the APIs are still low level primitives. Therefore, in order to apply communication optimization techniques, such as double buffering techniques, programmers have to manually write the programs with the primitives. Manual communication optimization requires programmers to have significant knowledge of both application characteristics and CPU-accelerator architecture. This prevents many application developers from effective utilization of accelerators. In addition, managing communication is a tedious and error-prone task even for expert programmers. Thus, it is necessary to develop a communication system which is highly abstracted but still capable of optimization. For this purpose, this paper proposes an OpenCL based communication library. To maximize performance improvement, the proposed library provides a simple but effective programming interface based on Stream Graph in order to specify an applications communication pattern. We have implemented a prototype system on OpenCL platform and applied it to several image processing applications. Our evaluation shows that the library successfully masks the details of accelerator memory management while it can achieve comparable speedup to manual optimization in which we use existing low level interfaces.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121340038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feng Zhao, Dan Zhao, Xiaofeng Hu, Wei Peng, Baosheng Wang, Zexin Lu
{"title":"A 3N Approach to Network Control and Management","authors":"Feng Zhao, Dan Zhao, Xiaofeng Hu, Wei Peng, Baosheng Wang, Zexin Lu","doi":"10.1109/IPDPSW.2012.156","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.156","url":null,"abstract":"As the network technology and applications continue to evolve, computer networks become more and more important. However, network users can attack the network infrastructure (such as domain name service and routing services, etc.). The networks can not provide the minimum required quality of service for control. Network situation can not be aware in a timely manner. And network maintaining and upgrading are not easy. We argue that one root cause of these problems is that control, management and forwarding function are intertwined tightly. We advocate a complete loosing of the functionality and propose an extreme design point that we call \"3N\", after the architecture's three separated networks: forwarding network, control network and management network. Accordingly, we introduce four network entities: forwarder, controller, manager and separators. In the 3N architecture, the forwarding network mainly forwards packets at the behest of the control network and the management network, the control network mainly perform route computation for the data network, and the management network mainly learn about the situation of the data network and distribute policies and configurations, and the three networks working together to consist a efficient network system. In this paper we presented a high level overview of 3N architecture and some research considerations in its realization. We think the 3N architecture is helpful to improve network security, availability, manageability, scalability and so on.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117005230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Parallel Resampling Algorithm for Particle Filtering on Shared-Memory Architectures","authors":"Peng Gong, Y. O. Basciftci, F. Özgüner","doi":"10.1109/IPDPSW.2012.184","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.184","url":null,"abstract":"Many real-world applications such as positioning, navigation, and target tracking for autonomous vehicles require the estimation of some time-varying states based on noisy measurements made on the system. Particle filters can be used when the system model and the measurement model are not Gaussian or linear. However, the computational complexity of particle filters prevents them from being widely adopted. Parallel implementation will make particle filters more feasible for real-time applications. Effective resampling algorithms like the systematic resampling algorithm are serial. In this paper, we propose the shared-memory systematic resampling (SMSR) algorithm that is easily parallelizable on existing architectures. We verify the performance of SMSR on graphics processing units. Experimental results show that the proposed SMSR algorithm can achieve a significant speedup over the serial particle filter.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115625878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication-Optimal Parallel N-body Solvers","authors":"Aparna Chandramowlishwaran, R. Vuduc","doi":"10.1109/IPDPSW.2012.303","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.303","url":null,"abstract":"We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N-body problems. Our research specifically addresses two key challenges. The first challenge is how to engineer fast code for today's platforms. We present the first in-depth study of multicore optimizations and tuning for FMM, along with a systematic approach for transforming a conventionally parallelized FMM into a highly-tuned one. We introduce novel optimizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and many core systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter-node communication costs. This analysis yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs-if there are no significant change-could cause it to become communication-bound as early as the year 2020. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of high-level algorithm-architecture co-design.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116177755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}