{"title":"Gossamer: A Lightweight Approach to Using Multicore Machines","authors":"J. Roback, G. Andrews","doi":"10.1109/ICPP.2010.12","DOIUrl":"https://doi.org/10.1109/ICPP.2010.12","url":null,"abstract":"The key to performance improvements in the multi-core era is for software to utilize the available concurrency. This paper presents a lightweight programming framework called Gossamer that is easy to use, enables the solution of a broad range of parallel programming problems, and produces efficient code. Gossamer contains (1) a set of high-level annotations that one adds to a sequential program to specify concurrency and synchronization, (2) a source-to-source translator that produces an optimized program that uses our threading library, and (3) a run-time system that provides efficient threads and synchronization. Gossamer supports iterative and recursive parallelism, pipelined computations, domain decomposition, and MapReduce computations.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131751765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Umesh Deshpande, Beilan Wang, Shafee Haque, M. R. Hines, Kartik Gopalan
{"title":"MemX: Virtualization of Cluster-Wide Memory","authors":"Umesh Deshpande, Beilan Wang, Shafee Haque, M. R. Hines, Kartik Gopalan","doi":"10.1109/ICPP.2010.74","DOIUrl":"https://doi.org/10.1109/ICPP.2010.74","url":null,"abstract":"We present MemX -- a distributed system that virtualizes cluster-wide memory to support data-intensive and large memory workloads in virtual machines (VMs). MemX provides a number of benefits in virtualized settings: (1) VM workloads that access large datasets can perform low-latency I/O over virtualized cluster-wide memory; (2) VMs can transparently execute very large memory applications that require more memory than physical DRAM present in the host machine; (3) MemX reduces the effective memory usage of the cluster by de-duplicating pages that have identical content; (4) existing applications do not require any modifications to benefit from MemX such as the use of special APIs, libraries, recompilation, or relinking; and (5) MemX supports live migration of large-footprint VMs by eliminating the need to migrate part of their memory footprint resident on other nodes. Detailed evaluations of our MemX prototype show that large dataset applications and multiple concurrent VMs achieve significant performance improvements using MemX compared against virtualized local and iSCSI disks.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128859456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Tseng, Wan-Ting Lai, Chi-Fu Huang, Fang-jing Wu
{"title":"Using Mobile Mules for Collecting Data from an Isolated Wireless Sensor Network","authors":"Y. Tseng, Wan-Ting Lai, Chi-Fu Huang, Fang-jing Wu","doi":"10.1109/ICPP.2010.75","DOIUrl":"https://doi.org/10.1109/ICPP.2010.75","url":null,"abstract":"This paper considers storage management in an isolated WSN, under the constraint that the storage space per node is limited. We formulate the memory spaces of these sensor nodes as a distributed storage system. Assuming that there is a sink in the WSN that will be visited by mobile mules intentionally (e.g., pre-arranged buses) or occasionally (e.g., non-pre-arranged taxis), we address three issues: (1) how to buffer sensory data to reduce data loss due to shortage of storage spaces, (2) if dropping of data is inevitable, how to avoid higher priority data from being dropped, and (3) how to keep higher priority data closer to the sink, such that the mobile mules can download more important data first when the downloading time is limited. We propose a Distributed Storage Management Strategy (DSMS) based on a novel shuffling mechanism similar to heap sort. It allows nodes to exchange sensory data with neighbors based on only local information. To the best of our knowledge, this is the first work addressing distributed and prioritized storing strategies for isolated WSNs.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127464424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs","authors":"Peng Di, Qing Wan, Xuemeng Zhang, Hui Wu, Jingling Xue","doi":"10.1109/ICPP.2010.13","DOIUrl":"https://doi.org/10.1109/ICPP.2010.13","url":null,"abstract":"To exploit the full potential of GPGPUs for general purpose computing, DOACR parallelism abundant in scientific and engineering applications must be harnessed. However, the presence of cross-iteration data dependences in DOACR loops poses an obstacle to execute their computations concurrently using a massive number of fine-grained threads. This work focuses on iterative PDE solvers rich in DOACR parallelism to identify optimization principles and strategies that allow their efficient mapping to GPGPUs. Our main finding is that certain DOACR loops can be accelerated further on GPGPUs if they are algorithmically restructured (by a domain expert) to be more amendable to GPGPU parallelization, judiciously optimized (by the compiler), and carefully tuned by a performance-tuning tool. We substantiate this finding with a case study by presenting a new parallel SSOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SSOR method and then parallelizing it by applying a non-dependence-preserving scheme consisting of a new domain decomposition technique followed by a generalized loop tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of applications, particularly PDE-based DOACR loops, on GPGPUs.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116802127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuyang Du, Hongliang Yu, G. Shi, Jing Chen, Weimin Zheng
{"title":"Microwiper: Efficient Memory Propagation in Live Migration of Virtual Machines","authors":"Yuyang Du, Hongliang Yu, G. Shi, Jing Chen, Weimin Zheng","doi":"10.1109/ICPP.2010.23","DOIUrl":"https://doi.org/10.1109/ICPP.2010.23","url":null,"abstract":"Live migration of virtual machines relocates running VM across physical hosts with unnoticeable service downtime. However, propagating changing VM memory at low cost, especially for write-intensive applications or at relatively low network bandwidth, is still uncovered. This paper presents Microwiper, an improvement of memory propagation in live migration. Our idea is twofold. We propose ordered propagation to transfer dirty memory pages according to their rewriting rates. We factor available network bandwidth in sending pages to throttle hot spot; after the accumulated rewriting rate exceeds the estimated bandwidth, next iteration is started immediately. The combination of these novel methods can not only reduce dirtied pages, but also shorten service downtime and total migration time. We implemented Microwiper by retrofitting the pre-copy approach in Xen hypervisor. We conducted detailed experiments to evaluate its efficacy on various workloads. The experimental results show that Microwiper can significantly reduce downtime and transferred pages by more than 50%. Microwiper has good adaptivity, and hence can be applied to other virtualization platforms easily.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127278453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalability of a Parallel JPEG Encoder on Shared Memory Architectures","authors":"David Castells-Rufas, Jaume Joven, J. Carrabina","doi":"10.1109/ICPP.2010.58","DOIUrl":"https://doi.org/10.1109/ICPP.2010.58","url":null,"abstract":"Embedded multimedia systems are expected to fully embrace the future many-core wave. As a consequence parallel programming is being revamped as the only way to exploit the power of coming chips. While waiting for them we try to extrapolate some lessons learned from current multi-cores to influence future architectures and programming methods. In this paper we investigate the parallelism and scalability of a JPEG image encoder, which is a typical embedded application, on several shared memory machines using the OpenMP programming framework. We identify the Huffman coding as the bottleneck that blocks the application from scaling above a 7x factor. We propose a strategy to parallelize the Huffman coding, which introduces a small degradation in some parts of the image, allowing to reach higher speedup factors. A factor of 18.8x has been reached in SGI Altix 4700 using 22 threads. Contrasting these results with some previous works using message passing architectures we consider that the use of OpenMP on top of shared memory architectures should be reconsidered for future chips in favor of message passing architectures and programming models.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121063018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Block-Parallel Programming for Real-Time Embedded Applications","authors":"D. Black-Schaffer, W. Dally","doi":"10.1109/ICPP.2010.37","DOIUrl":"https://doi.org/10.1109/ICPP.2010.37","url":null,"abstract":"Embedded media applications have traditionally used custom ASICs to meet their real-time performance requirements. However, the combination of increasing chip design cost and availability of commodity many-core processors is making programmable devices increasingly attractive alternatives. Yet for these processors to be successful in this role, programming systems are needed that can automate the task of mapping the applications to the tens-to-hundreds of cores on current and future many-core processors, while simultaneously guaranteeing the real-time throughput constraints. This paper presents a block-parallel program description for embedded real-time media applications and automatic transformations including buffering and parallelization to ensure the program meets the throughput requirements. These transformations are enabled by starting with a high-level, yet intuitive, application description. The description builds on traditional stream programming structures by adding simple control and serialization constructs to enable a greater variety of applications. The result is an application description that provides a balance of flexibility and power to the programmer, while exposing the application structure to the compiler at a high enough level to enable useful transformations without heroic analysis.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132662507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs","authors":"José L. Abellán, Juan Fernández, M. Acacio","doi":"10.1109/ICPP.2010.34","DOIUrl":"https://doi.org/10.1109/ICPP.2010.34","url":null,"abstract":"Barrier synchronization in shared memory parallel machines has been widely implemented through busy-waiting on shared variables. However, typical implementations of barrier synchronization tend to produce hot-spots in terms of memory and network contention, thus creating performance bottlenecks that become markedly more pronounced as the number of cores or processors increases. To overcome such limitations, we present a novel hardware-based barrier mechanism in the context of many-core CMPs. Our proposal is based on global interconnection lines (G-lines) and the S-CSMA technique, which have been recently used to enhance a flow control mechanism (EVC) in the context of networks-on-chip. Based on this technology, we have designed a simple and scalable G-line-based network that operates independently of the main data network, and that is aimed at carrying out barrier synchronizations efficiently. In the ideal case, our design takes only 4 cycles to perform a barrier synchronization once all cores or threads have arrived at the barrier. As a proof of concept, we examine the benefits of our proposal by comparing it with one of the best software approaches (a binary combining-tree barrier). To do so, we run several kernels and scientific applications on top of the Sim-PowerCMP performance simulator that models a 32-core CMP with a 2D-mesh network configuration. Our proposal entails average reductions in terms of execution time of 68% and 21% for kernels and scientific applications, respectively. Additionally, network traffic is also lowered by 74% and 18%, respectively.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"58 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113941105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hyperscalar: A Novel Dynamically Reconfigurable Multi-core Architecture","authors":"J. Chiu, Yu-Liang Chou, Po-Kai Chen","doi":"10.1109/ICPP.2010.35","DOIUrl":"https://doi.org/10.1109/ICPP.2010.35","url":null,"abstract":"This paper proposes a reconfigurable multi-core architecture, called hyperscalar that enables many scalar cores to be united dynamically as a larger superscalar processor to accelerate a thread. To accomplish this, we propose the virtual shared register files (VSRF) that allow the instructions of a thread executed in the united cores to logically face a uniform set of register files. We also propose the instruction analyzer (IA) with the capability of detecting and tagging the dependence information to the newly fetched instructions. According to the tags, instructions in the united cores can issue requests to obtain their remote operands via the VSRF. The reconfigurable feature of hyperscalar can cover a spectrum of workloads well, providing high single-thread performance when TLP is low and high throughput when TLP is high. Simulation results show that the a 8-core hyperscalar chip multiprocessor’s 2, 4, and 8-core-united configurations archive 94%, 90%, and 83% of the performance of the monolithic 2, 4, and 8-issue out-of-order superscalar processors with lower area costs and better support for software diversity.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"11 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115482379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and Implementation of a Hybrid Parallel Performance Measurement System","authors":"A. Morris, A. Malony, S. Shende, K. Huck","doi":"10.1109/ICPP.2010.57","DOIUrl":"https://doi.org/10.1109/ICPP.2010.57","url":null,"abstract":"Modern parallel performance measurement systems collect performance information either through probes inserted in the application code or via statistical sampling. Probe-based techniques measure performance metrics directly using calls to a measurement library that execute as part of the application. In contrast, sampling-based systems interrupt program execution to sample metrics for statistical analysis of performance. Although both measurement approaches are represented by robust tool frameworks in the performance community, each has its strengths and weaknesses. In this paper, we investigate the creation of a hybrid measurement system, the goal being to exploit the strengths of both systems and mitigate their weaknesses. We show how such a system can be used to provide the application programmer with a more complete analysis of their application. Simple example and application codes are used to demonstrate its capabilities. We also show how the hybrid techniques can be combined to provide real cross-language performance evaluation of an uninstrumented run for mixed compiled/interpreted execution environments (e.g., Python and C/C++/Fortran).","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124815695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}