{"title":"Direct N-body Kernels for Multicore Platforms","authors":"Nitin Arora, A. Shringarpure, R. Vuduc","doi":"10.1109/ICPP.2009.71","DOIUrl":"https://doi.org/10.1109/ICPP.2009.71","url":null,"abstract":"We present an inter-architectural comparison of single-and double-precision direct n-body implementations on modern multicore platforms, including those based on the Intel Nehalem and AMD Barcelona systems, the Sony-Toshiba-IBM PowerXCell/8i processor, and NVIDA Tesla C870 and C1060 GPU systems. We compare our implementations across platforms on a variety of proxy measures, including performance, coding complexity, and energy efficiency.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121087743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computing Equilibria in Bimatrix Games by Parallel Vertex Enumeration","authors":"J. Widger, Daniel Grosu","doi":"10.1109/ICPP.2009.11","DOIUrl":"https://doi.org/10.1109/ICPP.2009.11","url":null,"abstract":"Equilibria computation is of great importance to many areas such as economics, control theory, and recently computer science. We focus on the computation of Nash equilibria in two-player general-sum normal form games, also called bimatrix games. One efficient method to compute these equilibria is based on enumerating the vertices of the best response polyhedrons of the two players and checking the equilibrium conditions for every pair of vertices. We design and implement a parallel algorithm for computing Nash equilibria in bimatrix games based on vertex enumeration. We analyze the performance of the proposed algorithm by performing extensive experiments on a grid computing system.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122099436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the Cost-Availability Tradeoff in P2P Storage Systems","authors":"Zhi Yang, Yafei Dai, Zhen Xiao","doi":"10.1109/ICPP.2009.46","DOIUrl":"https://doi.org/10.1109/ICPP.2009.46","url":null,"abstract":"P2P storage systems use replication to provide a certain level of availability. While the system must generate new replicas to replace replicas lost to permanent failures, it can save significant replication cost by not replicating following transient failures. However, in real systems, it is impossible to reliably distinguish permanent and transients failures, resulting in a tradeoff between high recovery cost and low data availability. In this paper, we analyze the use of timeouts as a mechanism to navigate this tradeoff. We address the challenging problem of how to choose a timeout to walk the fine line between causing unnecessary replication due to detection inaccuracy, and reducing availability due to detection delay. We conduct simulations based both on synthetic and real traces, and show that the performance of our selected timeout closely approximates the optimal performance that can be achieved by timeouts, and even that of an “oracle” failure detector.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130553680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic-Based Robust Dynamic Resource Allocation in a Heterogeneous Computing System","authors":"Jay Smith, E. Chong, A. A. Maciejewski, H. Siegel","doi":"10.1109/ICPP.2009.45","DOIUrl":"https://doi.org/10.1109/ICPP.2009.45","url":null,"abstract":"This research investigates the problem of robust dynamic resource allocation for heterogeneous distributed computing systems operating under imposed constraints. Often, such systems are expected to function in an environment where uncertainty in system parameters is common. In such an environment, the amount of processing required to complete an application may fluctuate substantially. Determining a resource allocation that accounts for this uncertainty---in a way that can provide a probability that a given level of service is achieved---is an important area of research. We define a mathematical model of stochastic robustness appropriate for a dynamic environment that can be used during resource allocation to aid heuristic decision making. In addition, we design a novel technique for maximizing stochastic robustness in this environment. Our performance results for this technique are compared with several well known resource allocation techniques in a simulated environment that models a heterogeneous distributed computing system.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"98 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127890920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Characterization of a Hierarchical MPI Implementation on Large-scale Distributed-memory Platforms","authors":"S. Alam, R. Barrett, J. Kuehn, Steve Poole","doi":"10.1109/ICPP.2009.51","DOIUrl":"https://doi.org/10.1109/ICPP.2009.51","url":null,"abstract":"The building blocks of emerging Petascale massively parallel processing (MPP) systems are multi-core processors with four or more cores as a single processing element and a customized network interface. The resulting memory and communication hierarchy of these platforms are now exposed to application developers and end users by creating a hierarchical or multi-core aware message-passing (MPI) programming interface and by providing a handful of runtime, tunable parameters that allows mapping and control of MPI tasks and message handling. We characterize performance of MPI communication patterns and present strategies for optimizing applications performance on Cray XT series systems that are composed of contemporary AMD processors and a proprietary network infrastructure. We highlight dependencies in its memory and network subsystems, which could influence production-level applications performance. We demonstrate that MPI micro-benchmarks could mislead an application developer or end user since these benchmarks often do not expose the interplay between memory allocation and usage in the user space, which depends on the number of tasks or cores and workload characteristics. Our studies show performance improvements compared to the default options for our target scientific benchmarks and production-level applications.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125667825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mediacoop: Hierarchical Lookup for P2P-VoD Services","authors":"Tieying Zhang, Jianming Lv, Xueqi Cheng","doi":"10.1109/ICPP.2009.58","DOIUrl":"https://doi.org/10.1109/ICPP.2009.58","url":null,"abstract":"The random seeking in P2P-VoD system requires efficient lookup for “good” suppliers. The main challenge is that good suppliers should meet two requirements: “content match” and “quality match”, while most existing methods only focus on one aspect. In this paper, we propose Mediacoop, a novel structured lookup method combining both content and quality match to provide random seeking for P2P-VoD services. It exploits playpoint distance to efficiently locate the candidate suppliers with required data (content match), and performs refined lookup within the candidates to meet quality match. Theoretical analysis and simulations show that Mediacoop outperforms the traditional methods. Our real-world system also proves the effectiveness of the design.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125721939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GePSeA: A General-Purpose Software Acceleration Framework for Lightweight Task Offloading","authors":"Ajeet Singh, P. Balaji, W. Feng","doi":"10.1109/ICPP.2009.39","DOIUrl":"https://doi.org/10.1109/ICPP.2009.39","url":null,"abstract":"Hardware-acceleration techniques continue to be used to speed-up the execution of scientific codes. To do so, software developers identify portions of these codes that are amenable for offloading and map them to hardware accelerators. However, offloading such tasks to specialized hardware accelerators is non-trivial. Furthermore, these accelerators can add significant cost to a computing system. Consequently, we propose a framework called GePSeA (General Purpose Software Acceleration Framework), which uses a small fraction of the computational power on multi-core architectures to ``onload'' complex application-specific tasks. Specifically, GePSeA provides a lightweight process that acts as a helper agent to the application by executing application-specific tasks asynchronously and efficiently. We then apply the GePSeA framework to a real application, namely, an open-source computational biology application, and demonstrate significant application-level benefits.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134424487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Isosurface Extraction for Medical Volume Dataset on Cell BE","authors":"Hai Jin, Bo Li, Ran Zheng, Qin Zhang","doi":"10.1109/ICPP.2009.47","DOIUrl":"https://doi.org/10.1109/ICPP.2009.47","url":null,"abstract":"The size of volumetric data generated by medical imaging and scientific simulations is increased significantly due to the dramatic advances in medical imaging modalities and computing technologies. The volumetric data generally need to be visualized and Marching Cubes algorithm (MC for short) is one of the standard methods of the isosurface extraction for the medical applications. However, MC algorithm requires a large amount of data computing power. The Cell Broadband Engine (Cell for short) processor, which is a typical COTS (commodity off-the-shelf) heterogeneous designed to handle extremely demanding computations, can be used to hasten isosurface extraction in medial application. In this paper, we present a streaming model-based scheme to efficiently map MC algorithm to Cell. Specifically, a block-based filter running on PPE is imposed as a preprocessing stage to avoid unnecessary data transfer and computation, and the MC kernel runs on SPEs as the subsequent stage. Through tuning the size of the block, the workload of PPE and SPE is orchestrated harmoniously. The experimental results demonstrate that overall isosurface extraction speedup of more than 10 times is achieved compared with conventional heavy iron CPUs.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131655819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Parallel Algorithm for Computing Betweenness Centrality","authors":"Guangming Tan, Dengbiao Tu, Ninghui Sun","doi":"10.1109/ICPP.2009.53","DOIUrl":"https://doi.org/10.1109/ICPP.2009.53","url":null,"abstract":"In this paper we present a multi-grained parallel algorithm for computing betweenness centrality, which is extensively used in large-scale network analysis. Our method is based on a novel algorithmic handling of access conflicts for a CREW PRAM algorithm. We propose a proper data-processor mapping, a novel edge-numbering strategy and a new triple array data structure recording the shortest path for eliminating conflicts to access the shared memory. The algorithm requires $O(n+m)$ space and $O(frac{nm}{p})$ ( or $O(frac{nm+n^{2}logn}{p})$) time for unweighted (or weighted) graphs, and it is a work-optimal CREW PRAM algorithm. On current multi-core platforms, our algorithm outperforms the previous algorithm by 2-3 times.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132843410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Analysis of DHT Algorithms for Range-Query and Multi-Attribute Resource Discovery in Grids","authors":"Haiying Shen, Chengzhong Xu","doi":"10.1109/ICPP.2009.37","DOIUrl":"https://doi.org/10.1109/ICPP.2009.37","url":null,"abstract":"Resource discovery is critical to the usability and accessibility of grid computing systems. Distributed Hash Table (DHT) has been applied to grid systems as a distributed mechanism for providing scalable range-query and multiattribute resource discovery. Multi-DHT-based approaches depend on multiple DHT networks with each network responsible for a single attribute. Single-DHT-based approaches keep the resource information of all attributes in a single node. Both classes of approaches lead to high overhead. Recently, we proposed a heuristic Low-Overhead Range-query Multiattribute DHT-based resource discovery approach (LORM). It relies on a single hierarchical DHT network and distributes resource information among nodes in balance by taking advantage of the hierarchical structure. We demonstrated its effectiveness and efficiency via simulation. In this paper, we analyze the performance of the LORM approach rigorously by comparing it with other multi-DHT-based and single-DHTbased approaches with respect to their overhead and efficiency. The analytical results are consistent with simulation results. The results prove the superiority of the LORM approach in theory","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133041890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}