{"title":"Constructing Gene Regulatory Networks on Clusters of Cell Processors","authors":"J. Zola, Abhinav Sarje, S. Aluru","doi":"10.1109/ICPP.2009.35","DOIUrl":"https://doi.org/10.1109/ICPP.2009.35","url":null,"abstract":"Constructing genome-wide gene regulatory networks from a large number of gene expression profile measurements is an important problem in systems biology. While several techniques have been developed, none of them is parallel, and they lack the capability to scale to the whole-genome level or incorporate the largest data sets, particularly with rigorous statistical testing. To address this problem, we recently developed a mutual information theory based parallel method for gene network reconstruction. In this paper, we extend this work to a cluster of Cell processors. We use parallelization across multiple Cells, multiple cores within each Cell, and vector units within the cores to develop a high performance implementation that effectively addresses the scaling problem. We present experimental results comparing the Cell implementation with a standard uniprocessor implementation and an implementation on a conventional supercomputer. Finally, we report the construction of a large 15,203 gene network of the plant Arabidopsis thaliana from 2,996 microarray experiments on a 8-node Cell blade cluster in 2 hours and 24 minutes.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133213092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scott Biersdorff, Chee Wai Lee, A. Malony, L. Kalé
{"title":"Integrated Performance Views in Charm++: Projections Meets TAU","authors":"Scott Biersdorff, Chee Wai Lee, A. Malony, L. Kalé","doi":"10.1109/ICPP.2009.49","DOIUrl":"https://doi.org/10.1109/ICPP.2009.49","url":null,"abstract":"The Charm++ parallel programming system provides a modular performance interface that can be used to extend its performance measurement and analysis capabilities. The interface exposes execution events of interest representing Charm++ scheduling operations, application methods/routines, and communication events for observation by alternative performance modules configured to implement different measurement features. The paper describes the Charm++'s performance interface and how the Charm++ Projections tool and the TAU Performance System can provide integrated trace-based and profile-based performance views. These two tools are complementary, providing the user with different performance perspectives on Charm++ applications based on performance data detail and temporal and spatial analysis. How the tools work in practice is demonstrated in a parallel performance analysis of NAMD, a scalable molecular dynamics code that applies many of Charm++'s unique features.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128844679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Scalability of Parallel Verilog Simulation","authors":"S. Meraji, Wei Zhang, C. Tropper","doi":"10.1109/ICPP.2009.9","DOIUrl":"https://doi.org/10.1109/ICPP.2009.9","url":null,"abstract":"As a consequence of Moore’s law, the size of integrated circuits has grown extensively, resulting in simulation becoming the major bottleneck in the circuit design process. Consequently, parallel simulation has emerged as an approach which can be both fast and cost effective. In this paper, we examine the performance of a parallel Verilog simulator on four large, real designs. As previous work has made use of either relatively small benchmarks or synthetic circuits, the use of these circuits is far more realistic. We develop a parser for Verilog files enabling us to simulate in parallel all synthesizable Verilog circuits. We utilize four circuits as our test benches; the LEON Processor with 200k gates, the OpenSparc T2 processor with 400k gates and two Viterbi decoder circuits with 100k and 800k gates respectively. The simulator makes use of XTW and to our knowledge is the first Verilog simulator which can parse all synthesizable Verilog files. We observed 4,000,000 events per second on 32 processors for the Viterbi decoder with 800k gates. The simulators’ performance was shown to be scalable.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115499562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Darius Buntinas, Brice Goglin, David Goodell, Guillaume Mercier, Stéphanie Moreaud
{"title":"Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis","authors":"Darius Buntinas, Brice Goglin, David Goodell, Guillaume Mercier, Stéphanie Moreaud","doi":"10.1109/ICPP.2009.22","DOIUrl":"https://doi.org/10.1109/ICPP.2009.22","url":null,"abstract":"The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128169567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Distributed Three-hop Routing Protocol to Increase the Capacity of Hybrid Networks","authors":"Ze Li, Haiying Shen","doi":"10.1109/ICPP.2009.36","DOIUrl":"https://doi.org/10.1109/ICPP.2009.36","url":null,"abstract":"Hybrid wireless networks combining the advantages of both ad-hoc networks and infrastructure wireless networks have been receiving increasingly attentions because of their ultra-high performance. An efficient data routing protocol is an important component in such networks for high capacity and scalability. However, most routing protocols for the networks simply combine an ad-hoc transmission mode and a cellular transmission mode, which fail to take advantage of the dual-feature architecture. This paper presents a distributed Three-hop Routing (DTR) protocol for hybrid wireless networks. DTR divides a message data stream into segments and transmits the segments in a distributed manner. It makes full spatial reuse of system via high speed ad-hoc interface and alleviate mobile gateway congestion via cellular interface. Furthermore, sending segments to a number of base stations simultaneously increases the throughput, and makes full use of wide-spread base stations. In addition, DTR significantly reduces overhead due to short path length and eliminates route discovery and maintenance overhead. Theoretical analysis and simulation results show the superiority of DTR in comparison with other routing protocols in terms of throughput capacity, scalability and mobility resilience.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127435513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Communication Scheduling Using Dataflow Semantics","authors":"Adrian Soviani, J. Singh","doi":"10.1109/ICPP.2009.66","DOIUrl":"https://doi.org/10.1109/ICPP.2009.66","url":null,"abstract":"We show how coarse grain dataflow semantics (CGD) applied to SPMD algorithms makes application development and design space exploration simpler compared to message passing, at the same time providing on par performance. CGD applications are specified as dependencies between computation modules and data distributions. Communication and synchronization are added automatically and optimized for specific architectures, relieving programmers of this task. Many high level algorithm changes are easy to implement in CGD by redefining data distributions. These include exposing communication overlap by decreasing task grain, and aggregating communication by replicating data and computation. We briefly present a coordination language with dataflow semantics that implements the CGD model. Our implementation currently supports MPI, SHMEM, and pthreads. Results on Altix 4700 show our optimized CGD FT is 27% faster than original NPB 2.3 MPI implementation, and optimized CGD stencil has a 41% advantage over handwritten MPI.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122775042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication","authors":"M. Nüssle, Martin Scherer, U. Brüning","doi":"10.1109/ICPP.2009.62","DOIUrl":"https://doi.org/10.1109/ICPP.2009.62","url":null,"abstract":"This paper introduces a new highly optimized architecture for remote memory access (RMA). RMA, using put and get operations, is a one-sided communication function which amongst others is important in current and upcoming Partitioned Global Address Space (PGAS) systems. In this work, a virtualized hardware unit is described which is resource optimized, exhibits high overlap, processor offload and very good latency characteristics. To start an RMA operation a single HyperTransport packet caused by one CPU instruction is sufficient, thus reducing latency to an absolute minimum. In addition to the basic architecture an implementation in FPGA technology is presented together with an evaluation of the target ASIC-implementation. The current system can sustain more than 4.9 million transactions per second on the FPGA and exhibits an end-to-end latency of 1.2 μs for an 8-byte put operation. Both values are limited by the FPGA technology used for the prototype implementation. An estimation of the performance reachable on ASIC technology suggests that application to application latencies of less than 500 ns are feasible.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116540902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Parallel Execution of an Event-Based Radio Signal Propagation Model for Cluttered 3D Terrains","authors":"S. Seal, K. Perumalla","doi":"10.1109/ICPP.2009.42","DOIUrl":"https://doi.org/10.1109/ICPP.2009.42","url":null,"abstract":"Estimation of radio signal strength is essential in many applications, including the design of military radio communications and industrial wireless installations. While classical approaches such as finite difference methods are well-known, new event-based models of radio signal propagation have been recently shown to deliver such estimates faster (via serial execution) when compared to other methods. For scenarios with large or richly-featured geographical volumes, however, parallel processing is required to meet the memory and computation time demands. Here, we present a scalable and efficient parallel execution of a recently-developed event-based radio signal propagation model. We demonstrate its scalability to thousands of processors, with parallel speedups over 1000x. The speed and scale achieved by our parallel execution allow for larger scenarios and faster execution than has ever been reported before.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115105746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LeWI: A Runtime Balancing Algorithm for Nested Parallelism","authors":"Marta Garcia, J. Corbalán, J. Labarta","doi":"10.1109/ICPP.2009.56","DOIUrl":"https://doi.org/10.1109/ICPP.2009.56","url":null,"abstract":"We present LeWI: a novel load balancing algorithm, that can balance applications with very different patterns of imbalance. Our algorithm can balance fine grain imbalances, non iterative applications and applications with irregular imbalance. To achieve this LeWI reassigns the computational resources of blocked processes to other processes more loaded. We have implemented LeWI within DLB a Dynamic Load Balancing Library developed by us. DLB helps parallel programming models to make the most of the computational power available with the minimum effort. It solves the imbalance among processes in applications with two levels of parallelism using the malleability of the inner level. The performance evaluation shows that LeWI, the novel balancing algorithm we are presenting in this paper, together with DLB is able to improve the performance of a different range of unbalanced applications and when applied to well balanced applications it does not introduce significant overhead. Therefore we present a mechanism that can be used with any hybrid application without needing a programmer to analyze the application nor modify it.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"651 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123347843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Group Operation Assembly Language - A Flexible Way to Express Collective Communication","authors":"T. Hoefler, Christian Siebert, A. Lumsdaine","doi":"10.1109/ICPP.2009.70","DOIUrl":"https://doi.org/10.1109/ICPP.2009.70","url":null,"abstract":"The implementation and optimization of collective communication operations is an important field of active research. Such operations directly influence application performance and need to map the communication requirements in an optimal way to steadily changing network architectures. In this work, we define an abstract domain-specific language to express arbitrary group communication operations. We show the universality of this language and how all existing collective operations can be implemented with it. By design, it readily lends itself to blocking and nonblocking execution, as well as to off-loaded execution of complex group communication operations. We also define several offline and online optimizations (compiler transformations and scheduling decisions, respectively) to improve the overall performance of the operation. Performance results show that the overhead to express current collective operations is negligible in comparison to the potential gains in a highly optimized implementation.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"360 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121721591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}