2010 39th International Conference on Parallel Processing最新文献_第2页

Gossamer: A Lightweight Approach to Using Multicore Machines Gossamer:使用多核机器的轻量级方法

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI: 10.1109/ICPP.2010.12

J. Roback, G. Andrews

引用次数: 3

MemX: Virtualization of Cluster-Wide Memory MemX:集群范围内内存的虚拟化

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI: 10.1109/ICPP.2010.74

Umesh Deshpande, Beilan Wang, Shafee Haque, M. R. Hines, Kartik Gopalan

引用次数: 28

Using Mobile Mules for Collecting Data from an Isolated Wireless Sensor Network 使用移动骡子从孤立的无线传感器网络收集数据

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI: 10.1109/ICPP.2010.75

Y. Tseng, Wan-Ting Lai, Chi-Fu Huang, Fang-jing Wu

引用次数: 17

Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs 利用多gpu的DOACROSS并行性

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI: 10.1109/ICPP.2010.13

Peng Di, Qing Wan, Xuemeng Zhang, Hui Wu, Jingling Xue

{"title":"Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs","authors":"Peng Di, Qing Wan, Xuemeng Zhang, Hui Wu, Jingling Xue","doi":"10.1109/ICPP.2010.13","DOIUrl":"https://doi.org/10.1109/ICPP.2010.13","url":null,"abstract":"To exploit the full potential of GPGPUs for general purpose computing, DOACR parallelism abundant in scientific and engineering applications must be harnessed. However, the presence of cross-iteration data dependences in DOACR loops poses an obstacle to execute their computations concurrently using a massive number of fine-grained threads. This work focuses on iterative PDE solvers rich in DOACR parallelism to identify optimization principles and strategies that allow their efficient mapping to GPGPUs. Our main finding is that certain DOACR loops can be accelerated further on GPGPUs if they are algorithmically restructured (by a domain expert) to be more amendable to GPGPU parallelization, judiciously optimized (by the compiler), and carefully tuned by a performance-tuning tool. We substantiate this finding with a case study by presenting a new parallel SSOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SSOR method and then parallelizing it by applying a non-dependence-preserving scheme consisting of a new domain decomposition technique followed by a generalized loop tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of applications, particularly PDE-based DOACR loops, on GPGPUs.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116802127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Microwiper: Efficient Memory Propagation in Live Migration of Virtual Machines Microwiper:虚拟机实时迁移中的高效内存传播

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI: 10.1109/ICPP.2010.23

Yuyang Du, Hongliang Yu, G. Shi, Jing Chen, Weimin Zheng

{"title":"Microwiper: Efficient Memory Propagation in Live Migration of Virtual Machines","authors":"Yuyang Du, Hongliang Yu, G. Shi, Jing Chen, Weimin Zheng","doi":"10.1109/ICPP.2010.23","DOIUrl":"https://doi.org/10.1109/ICPP.2010.23","url":null,"abstract":"Live migration of virtual machines relocates running VM across physical hosts with unnoticeable service downtime. However, propagating changing VM memory at low cost, especially for write-intensive applications or at relatively low network bandwidth, is still uncovered. This paper presents Microwiper, an improvement of memory propagation in live migration. Our idea is twofold. We propose ordered propagation to transfer dirty memory pages according to their rewriting rates. We factor available network bandwidth in sending pages to throttle hot spot; after the accumulated rewriting rate exceeds the estimated bandwidth, next iteration is started immediately. The combination of these novel methods can not only reduce dirtied pages, but also shorten service downtime and total migration time. We implemented Microwiper by retrofitting the pre-copy approach in Xen hypervisor. We conducted detailed experiments to evaluate its efficacy on various workloads. The experimental results show that Microwiper can significantly reduce downtime and transferred pages by more than 50%. Microwiper has good adaptivity, and hence can be applied to other virtualization platforms easily.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127278453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Scalability of a Parallel JPEG Encoder on Shared Memory Architectures 共享内存架构下并行JPEG编码器的可扩展性

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI: 10.1109/ICPP.2010.58

David Castells-Rufas, Jaume Joven, J. Carrabina

{"title":"Scalability of a Parallel JPEG Encoder on Shared Memory Architectures","authors":"David Castells-Rufas, Jaume Joven, J. Carrabina","doi":"10.1109/ICPP.2010.58","DOIUrl":"https://doi.org/10.1109/ICPP.2010.58","url":null,"abstract":"Embedded multimedia systems are expected to fully embrace the future many-core wave. As a consequence parallel programming is being revamped as the only way to exploit the power of coming chips. While waiting for them we try to extrapolate some lessons learned from current multi-cores to influence future architectures and programming methods. In this paper we investigate the parallelism and scalability of a JPEG image encoder, which is a typical embedded application, on several shared memory machines using the OpenMP programming framework. We identify the Huffman coding as the bottleneck that blocks the application from scaling above a 7x factor. We propose a strategy to parallelize the Huffman coding, which introduces a small degradation in some parts of the image, allowing to reach higher speedup factors. A factor of 18.8x has been reached in SGI Altix 4700 using 22 threads. Contrasting these results with some previous works using message passing architectures we consider that the use of OpenMP on top of shared memory architectures should be reconsidered for future chips in favor of message passing architectures and programming models.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121063018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Block-Parallel Programming for Real-Time Embedded Applications 实时嵌入式应用的块并行编程

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI: 10.1109/ICPP.2010.37

D. Black-Schaffer, W. Dally

{"title":"Block-Parallel Programming for Real-Time Embedded Applications","authors":"D. Black-Schaffer, W. Dally","doi":"10.1109/ICPP.2010.37","DOIUrl":"https://doi.org/10.1109/ICPP.2010.37","url":null,"abstract":"Embedded media applications have traditionally used custom ASICs to meet their real-time performance requirements. However, the combination of increasing chip design cost and availability of commodity many-core processors is making programmable devices increasingly attractive alternatives. Yet for these processors to be successful in this role, programming systems are needed that can automate the task of mapping the applications to the tens-to-hundreds of cores on current and future many-core processors, while simultaneously guaranteeing the real-time throughput constraints. This paper presents a block-parallel program description for embedded real-time media applications and automatic transformations including buffering and parallelization to ensure the program meets the throughput requirements. These transformations are enabled by starting with a high-level, yet intuitive, application description. The description builds on traditional stream programming structures by adding simple control and serialization constructs to enable a greater variety of applications. The result is an application description that provides a balance of flexibility and power to the programmer, while exposing the application structure to the compiler at a high enough level to enable useful transformations without heroic analysis.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132662507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs 基于g线的多核cmp快速高效屏障同步网络

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI: 10.1109/ICPP.2010.34

José L. Abellán, Juan Fernández, M. Acacio

{"title":"A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs","authors":"José L. Abellán, Juan Fernández, M. Acacio","doi":"10.1109/ICPP.2010.34","DOIUrl":"https://doi.org/10.1109/ICPP.2010.34","url":null,"abstract":"Barrier synchronization in shared memory parallel machines has been widely implemented through busy-waiting on shared variables. However, typical implementations of barrier synchronization tend to produce hot-spots in terms of memory and network contention, thus creating performance bottlenecks that become markedly more pronounced as the number of cores or processors increases. To overcome such limitations, we present a novel hardware-based barrier mechanism in the context of many-core CMPs. Our proposal is based on global interconnection lines (G-lines) and the S-CSMA technique, which have been recently used to enhance a flow control mechanism (EVC) in the context of networks-on-chip. Based on this technology, we have designed a simple and scalable G-line-based network that operates independently of the main data network, and that is aimed at carrying out barrier synchronizations efficiently. In the ideal case, our design takes only 4 cycles to perform a barrier synchronization once all cores or threads have arrived at the barrier. As a proof of concept, we examine the benefits of our proposal by comparing it with one of the best software approaches (a binary combining-tree barrier). To do so, we run several kernels and scientific applications on top of the Sim-PowerCMP performance simulator that models a 32-core CMP with a 2D-mesh network configuration. Our proposal entails average reductions in terms of execution time of 68% and 21% for kernels and scientific applications, respectively. Additionally, network traffic is also lowered by 74% and 18%, respectively.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"58 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113941105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Hyperscalar: A Novel Dynamically Reconfigurable Multi-core Architecture 超标量:一种新的动态可重构多核体系结构

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI: 10.1109/ICPP.2010.35

J. Chiu, Yu-Liang Chou, Po-Kai Chen

引用次数: 6

Design and Implementation of a Hybrid Parallel Performance Measurement System 一种混合并行性能测量系统的设计与实现

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI: 10.1109/ICPP.2010.57

A. Morris, A. Malony, S. Shende, K. Huck

{"title":"Design and Implementation of a Hybrid Parallel Performance Measurement System","authors":"A. Morris, A. Malony, S. Shende, K. Huck","doi":"10.1109/ICPP.2010.57","DOIUrl":"https://doi.org/10.1109/ICPP.2010.57","url":null,"abstract":"Modern parallel performance measurement systems collect performance information either through probes inserted in the application code or via statistical sampling. Probe-based techniques measure performance metrics directly using calls to a measurement library that execute as part of the application. In contrast, sampling-based systems interrupt program execution to sample metrics for statistical analysis of performance. Although both measurement approaches are represented by robust tool frameworks in the performance community, each has its strengths and weaknesses. In this paper, we investigate the creation of a hybrid measurement system, the goal being to exploit the strengths of both systems and mitigate their weaknesses. We show how such a system can be used to provide the application programmer with a more complete analysis of their application. Simple example and application codes are used to demonstrate its capabilities. We also show how the hybrid techniques can be combined to provide real cross-language performance evaluation of an uninstrumented run for mixed compiled/interpreted execution environments (e.g., Python and C/C++/Fortran).","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124815695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27