{"title":"The fresh breeze project: A multi-core chip supporting composable parallel programming","authors":"J. Dennis","doi":"10.1109/IPDPS.2008.4536391","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536391","url":null,"abstract":"The Fresh Breeze project concerns the architecture and design of a multicore chip that can achieve superior performance while supporting composability of parallel programs. The requirements of composability imply that the management of processor allocation and memory management must be sufficiently flexible to permit reassignment of resources according to the current needs of computations. The Fresh Breeze programming model combines the spawn/join threading model of Cilk with a write-once memory model based on fixed-size chunks that are allocated and freed by efficient hardware mechanisms. This model supports computing jobs by many users, each consisting of a hierarchy of function activations. The model satisfies all six principles for supporting modular program construction. Within this programming model, it is possible for any parallel program to be used, without change, as a component in building larger parallel programs.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123565947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seetharami R. Seelam, I. Chung, Ding-Yong Hong, H. Wen, Hao Yu
{"title":"Early experiences in application level I/O tracing on blue gene systems","authors":"Seetharami R. Seelam, I. Chung, Ding-Yong Hong, H. Wen, Hao Yu","doi":"10.1109/IPDPS.2008.4536550","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536550","url":null,"abstract":"On todays massively parallel processing (MPP) supercomputers, it is increasingly important to understand I/O performance of an application both to guide scalable application development and to tune its performance. These two critical steps are often enabled by performance analysis tools to obtain performance data on thousands of processors in an MPP system. To this end, we present the design, implementation, and early experiences of an application level I/O tracing library and the corresponding tool for analyzing and optimizing I/O performance on Blue Gene (BG) MPP systems. This effort was a part of IBM UPC Toolkit for BG systems. To our knowledge, this is the first comprehensive application-level I/O monitoring, playback, and optimizing tool available on BG systems. The preliminary experiments on popular NPB BTIO benchmark show that the tool is much useful on facilitating detailed I/O performance analysis.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125323267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Chawla, D. Thain, Ryan Lichtenwalter, David A. Cieslak
{"title":"Data mining on the grid for the grid","authors":"N. Chawla, D. Thain, Ryan Lichtenwalter, David A. Cieslak","doi":"10.1109/IPDPS.2008.4536427","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536427","url":null,"abstract":"Both users and administrators of computing grids are presented with enormous challenges in debugging and troubleshooting. Diagnosing a problem with one application on one machine is hard enough, but diagnosing problems in workloads of millions of jobs running on thousands of machines is a problem of a new order of magnitude. Suppose that a user submits one million jobs to a grid, only to discover some time later that half of them have failed, Users of large scale systems need tools that describe the overall situation, indicating what problems are commonplace versus occasional, and which are deterministic versus random. Machine learning techniques can be used to debug these kinds of problems in large scale systems. We present a comprehensive framework from data to knowledge discovery as an important step towards achieving this vision.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125549948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault tolerance with shortest paths in regular and irregular networks","authors":"F. Sem-Jacobsen, Olav Lysne","doi":"10.1109/IPDPS.2008.4536280","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536280","url":null,"abstract":"Fault tolerance has become an important part of current supercomputers. Local dynamic fault tolerance is the most expedient way of tolerating faults by preconfiguring the network with multiple paths from every node/switch to every destination. In this paper we present a local shortest path dynamic fault-tolerance mechanism inspired by a solution developed for the Internet, that can be applied to any shortest path routing algorithm such as dimension ordered routing, fat tree routing, layered shortest path, etc., and provide a solution for achieving deadlock freedom in the presence of faults. Simulation results show that 1) for fat trees this yields the to this day highest throughput and lowest requirements on virtual layers for dynamic one-fault tolerance, 2) we require in general few layers to achieve deadlock freedom, and 3) for irregular topologies it gives at most a 10 times performance increase compared to FRoots.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126874387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic construction of coordinated performance skeletons","authors":"J. Subhlok, Qiang Xu","doi":"10.1109/IPDPS.2008.4536405","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536405","url":null,"abstract":"Performance prediction is particularly challenging for dynamic and unpredictable environments that cannot be modeled well, such as execution with sharing of CPU and bandwidth resources. Our approach to performance estimation in such scenarios is based on actual execution of short running customized performance skeletons for target applications. This work focuses on automatic construction of performance skeletons for parallel MPI programs. Logicalization of a family of traces to a single trace is presented as a key technique for skeleton construction. Compression of communication traces is achieved by identifying the loop structure from traces. Results are presented that demonstrate that logicalization and compression are accurate and efficient. Automatically constructed performance skeletons were able to effectively predict application performance in a variety of scenarios involving resource sharing and changes in the execution environment.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115025509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and implementation of Open-MX: High-performance message passing over generic Ethernet hardware","authors":"Brice Goglin","doi":"10.1109/IPDPS.2008.4536140","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536140","url":null,"abstract":"Open-MX is a new message passing layer implemented on top of the generic Ethernet stack of the Linux kernel. It provides high-performance communication on top of any Ethernet hardware while exhibiting the Myrinet Express application interface. Open-MX also enables wire- interoperability with Myricom's MXoE hosts. This article presents the design of the Open-MX stack which reproduces the MX firmware in a Linux driver. MPICH-MX and PVFS2 layers are already able to work flawlessly on Open-MX. The first performance evaluation shows interesting latency and bandwidth results on 1 and 10 gigabit hardware.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116440680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olivier Beaumont, Philippe Duchon, M. Korzeniowski
{"title":"Heterogenous dating service with application to rumor spreading","authors":"Olivier Beaumont, Philippe Duchon, M. Korzeniowski","doi":"10.1109/IPDPS.2008.4536294","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536294","url":null,"abstract":"Peer-to-peer overlay networks have proven their efficiency for storing and retrieving data at large scale, but new services are required to take the actual performances of resources into account. In this paper, we describe a fully decentralized algorithm, called \"dating service\" meant to organize communications in a fully heterogeneous network, that ensures that communication capabilities of the nodes are not exceeded. We prove that with high probability, this service ensures that a constant fraction of all possible communications is organized. Interestingly enough, this property holds true even if a node is not able to choose another node uniformly at random. In particular, the dating service can be implemented over existing DHT-based systems. In order to illustrate the expressiveness and the usefulness of proposed service, we also present a possible practical application of the dating service. As an illustration, we propose an algorithm for rumor spreading that enables to broadcast a unit-size message to all the nodes of a P2P system in logarithmic number of steps with high probability.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"370 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116563449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Locality aware MPI communication on a commodity opto-electronic hybrid network","authors":"Shin'ichiro Takizawa, Toshio Endo, S. Matsuoka","doi":"10.1109/IPDPS.2008.4536343","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536343","url":null,"abstract":"Future supercomputers with millions of processors would pose significant challenges in their interconnection networks due to difficulty in design constraints such as space, cable length, cost, power consumption, etc. Instead of huge switches or bisection bandwidth restricted topologies such as a torus, we propose a network which utilizes both fully-connected lower-bandwidth electronic packet switching (EPS) network and low-power optical circuit switching (OCS) network. Optical circuits, connected sparingly to only a limited set of nodes to conserve power and cost, are used in a supplemental fashion as \"shortcut\" routes only when a node communicates substantially across EPS switches, while short latency communication is handled by EPS only. Our MPI inter-node communication algorithm accommodates for such a network by appropriate scheduling of nodes according to application communication patterns, in particular utilizing relatively high EPS local switch bandwidth to forward messages to nodes with optical connections for shortcutting in order to maximize overall throughput. Simulation studies confirm that our proposal effectively avoids contentions in the network in high-bandwidth applications with nominal additions of optical circuitry to existing machines.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122794440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-organized forensic support in MANETs","authors":"Xiwei Zhao, V. Ganapathy, N. Pissinou, K. Makki","doi":"10.1109/IPDPS.2008.4536127","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536127","url":null,"abstract":"A distributed hash table (DHT) based approach for supporting forensic capability in mobile ad hoc networks (MANETs) is presented. The DHT-based approach has been modified to inhibit recursive increase in bandwidth consumption due to forensic activity - the process of logging is associated with that of packet delivery via customizable decreasing functions. Simulation has revealed that this approach limits the bandwidth requirement for forensic activities, although it requires a trade-off between bandwidth consumption and effective logging. The focus is to design a self-organized logging system over networks with dynamic topology.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117041181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault tolerant scheduling of precedence task graphs on heterogeneous platforms","authors":"A. Benoit, M. Hakem, Y. Robert","doi":"10.1109/IPDPS.2008.4536133","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536133","url":null,"abstract":"Fault tolerance and latency are important requirements in several applications which are time critical in nature: such applications require guaranties in terms of latency, even when processors are subject to failures. In this paper, we propose a fault tolerant scheduling heuristic for mapping precedence task graphs on heterogeneous systems. Our approach is based on an active replication scheme, capable of supporting epsiv arbitrary fail-silent (fail-stop) processor failures, hence valid results will be provided even if epsiv processors fail. We focus on a bi-criteria approach, where we aim at minimizing the latency given a fixed number of failures supported in the system, or the other way round. Major achievements include a low complexity, and a drastic reduction of the number of additional communications induced by the replication mechanism. Experimental results demonstrate that our heuristics, despite their lower complexity, outperform their direct competitor, the FTBAR scheduling algorithm [3].","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117105455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}