2009 International Conference on Parallel Processing最新文献

筛选
英文 中文
Constructing Gene Regulatory Networks on Clusters of Cell Processors 在细胞处理器集群上构建基因调控网络
2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI: 10.1109/ICPP.2009.35
J. Zola, Abhinav Sarje, S. Aluru
{"title":"Constructing Gene Regulatory Networks on Clusters of Cell Processors","authors":"J. Zola, Abhinav Sarje, S. Aluru","doi":"10.1109/ICPP.2009.35","DOIUrl":"https://doi.org/10.1109/ICPP.2009.35","url":null,"abstract":"Constructing genome-wide gene regulatory networks from a large number of gene expression profile measurements is an important problem in systems biology. While several techniques have been developed, none of them is parallel, and they lack the capability to scale to the whole-genome level or incorporate the largest data sets, particularly with rigorous statistical testing. To address this problem, we recently developed a mutual information theory based parallel method for gene network reconstruction. In this paper, we extend this work to a cluster of Cell processors. We use parallelization across multiple Cells, multiple cores within each Cell, and vector units within the cores to develop a high performance implementation that effectively addresses the scaling problem. We present experimental results comparing the Cell implementation with a standard uniprocessor implementation and an implementation on a conventional supercomputer. Finally, we report the construction of a large 15,203 gene network of the plant Arabidopsis thaliana from 2,996 microarray experiments on a 8-node Cell blade cluster in 2 hours and 24 minutes.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133213092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Integrated Performance Views in Charm++: Projections Meets TAU 在Charm++中的综合性能视图:投影满足TAU
2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI: 10.1109/ICPP.2009.49
Scott Biersdorff, Chee Wai Lee, A. Malony, L. Kalé
{"title":"Integrated Performance Views in Charm++: Projections Meets TAU","authors":"Scott Biersdorff, Chee Wai Lee, A. Malony, L. Kalé","doi":"10.1109/ICPP.2009.49","DOIUrl":"https://doi.org/10.1109/ICPP.2009.49","url":null,"abstract":"The Charm++ parallel programming system provides a modular performance interface that can be used to extend its performance measurement and analysis capabilities. The interface exposes execution events of interest representing Charm++ scheduling operations, application methods/routines, and communication events for observation by alternative performance modules configured to implement different measurement features. The paper describes the Charm++'s performance interface and how the Charm++ Projections tool and the TAU Performance System can provide integrated trace-based and profile-based performance views. These two tools are complementary, providing the user with different performance perspectives on Charm++ applications based on performance data detail and temporal and spatial analysis. How the tools work in practice is demonstrated in a parallel performance analysis of NAMD, a scalable molecular dynamics code that applies many of Charm++'s unique features.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128844679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
On the Scalability of Parallel Verilog Simulation 并行Verilog仿真的可扩展性研究
2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI: 10.1109/ICPP.2009.9
S. Meraji, Wei Zhang, C. Tropper
{"title":"On the Scalability of Parallel Verilog Simulation","authors":"S. Meraji, Wei Zhang, C. Tropper","doi":"10.1109/ICPP.2009.9","DOIUrl":"https://doi.org/10.1109/ICPP.2009.9","url":null,"abstract":"As a consequence of Moore’s law, the size of integrated circuits has grown extensively, resulting in simulation becoming the major bottleneck in the circuit design process. Consequently, parallel simulation has emerged as an approach which can be both fast and cost effective. In this paper, we examine the performance of a parallel Verilog simulator on four large, real designs. As previous work has made use of either relatively small benchmarks or synthetic circuits, the use of these circuits is far more realistic. We develop a parser for Verilog files enabling us to simulate in parallel all synthesizable Verilog circuits. We utilize four circuits as our test benches; the LEON Processor with 200k gates, the OpenSparc T2 processor with 400k gates and two Viterbi decoder circuits with 100k and 800k gates respectively. The simulator makes use of XTW and to our knowledge is the first Verilog simulator which can parse all synthesizable Verilog files. We observed 4,000,000 events per second on 32 processors for the Viterbi decoder with 800k gates. The simulators’ performance was shown to be scalable.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115499562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis 缓存高效、内部网、大消息MPI通信与MPICH2-Nemesis
2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI: 10.1109/ICPP.2009.22
Darius Buntinas, Brice Goglin, David Goodell, Guillaume Mercier, Stéphanie Moreaud
{"title":"Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis","authors":"Darius Buntinas, Brice Goglin, David Goodell, Guillaume Mercier, Stéphanie Moreaud","doi":"10.1109/ICPP.2009.22","DOIUrl":"https://doi.org/10.1109/ICPP.2009.22","url":null,"abstract":"The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128169567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
A Distributed Three-hop Routing Protocol to Increase the Capacity of Hybrid Networks 提高混合网络容量的分布式三跳路由协议
2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI: 10.1109/ICPP.2009.36
Ze Li, Haiying Shen
{"title":"A Distributed Three-hop Routing Protocol to Increase the Capacity of Hybrid Networks","authors":"Ze Li, Haiying Shen","doi":"10.1109/ICPP.2009.36","DOIUrl":"https://doi.org/10.1109/ICPP.2009.36","url":null,"abstract":"Hybrid wireless networks combining the advantages of both ad-hoc networks and infrastructure wireless networks have been receiving increasingly attentions because of their ultra-high performance. An efficient data routing protocol is an important component in such networks for high capacity and scalability. However, most routing protocols for the networks simply combine an ad-hoc transmission mode and a cellular transmission mode, which fail to take advantage of the dual-feature architecture. This paper presents a distributed Three-hop Routing (DTR) protocol for hybrid wireless networks. DTR divides a message data stream into segments and transmits the segments in a distributed manner. It makes full spatial reuse of system via high speed ad-hoc interface and alleviate mobile gateway congestion via cellular interface. Furthermore, sending segments to a number of base stations simultaneously increases the throughput, and makes full use of wide-spread base stations. In addition, DTR significantly reduces overhead due to short path length and eliminates route discovery and maintenance overhead. Theoretical analysis and simulation results show the superiority of DTR in comparison with other routing protocols in terms of throughput capacity, scalability and mobility resilience.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127435513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Optimizing Communication Scheduling Using Dataflow Semantics 使用数据流语义优化通信调度
2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI: 10.1109/ICPP.2009.66
Adrian Soviani, J. Singh
{"title":"Optimizing Communication Scheduling Using Dataflow Semantics","authors":"Adrian Soviani, J. Singh","doi":"10.1109/ICPP.2009.66","DOIUrl":"https://doi.org/10.1109/ICPP.2009.66","url":null,"abstract":"We show how coarse grain dataflow semantics (CGD) applied to SPMD algorithms makes application development and design space exploration simpler compared to message passing, at the same time providing on par performance. CGD applications are specified as dependencies between computation modules and data distributions. Communication and synchronization are added automatically and optimized for specific architectures, relieving programmers of this task. Many high level algorithm changes are easy to implement in CGD by redefining data distributions. These include exposing communication overlap by decreasing task grain, and aggregating communication by replicating data and computation. We briefly present a coordination language with dataflow semantics that implements the CGD model. Our implementation currently supports MPI, SHMEM, and pthreads. Results on Altix 4700 show our optimized CGD FT is 27% faster than original NPB 2.3 MPI implementation, and optimized CGD stencil has a 41% advantage over handwritten MPI.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122775042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication 面向低延迟通信的资源优化远程内存访问架构
2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI: 10.1109/ICPP.2009.62
M. Nüssle, Martin Scherer, U. Brüning
{"title":"A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication","authors":"M. Nüssle, Martin Scherer, U. Brüning","doi":"10.1109/ICPP.2009.62","DOIUrl":"https://doi.org/10.1109/ICPP.2009.62","url":null,"abstract":"This paper introduces a new highly optimized architecture for remote memory access (RMA). RMA, using put and get operations, is a one-sided communication function which amongst others is important in current and upcoming Partitioned Global Address Space (PGAS) systems. In this work, a virtualized hardware unit is described which is resource optimized, exhibits high overlap, processor offload and very good latency characteristics. To start an RMA operation a single HyperTransport packet caused by one CPU instruction is sufficient, thus reducing latency to an absolute minimum. In addition to the basic architecture an implementation in FPGA technology is presented together with an evaluation of the target ASIC-implementation. The current system can sustain more than 4.9 million transactions per second on the FPGA and exhibits an end-to-end latency of 1.2 μs for an 8-byte put operation. Both values are limited by the FPGA technology used for the prototype implementation. An estimation of the performance reachable on ASIC technology suggests that application to application latencies of less than 500 ns are feasible.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116540902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Scalable Parallel Execution of an Event-Based Radio Signal Propagation Model for Cluttered 3D Terrains 杂乱三维地形中基于事件的无线电信号传播模型的可扩展并行执行
2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI: 10.1109/ICPP.2009.42
S. Seal, K. Perumalla
{"title":"Scalable Parallel Execution of an Event-Based Radio Signal Propagation Model for Cluttered 3D Terrains","authors":"S. Seal, K. Perumalla","doi":"10.1109/ICPP.2009.42","DOIUrl":"https://doi.org/10.1109/ICPP.2009.42","url":null,"abstract":"Estimation of radio signal strength is essential in many applications, including the design of military radio communications and industrial wireless installations. While classical approaches such as finite difference methods are well-known, new event-based models of radio signal propagation have been recently shown to deliver such estimates faster (via serial execution) when compared to other methods. For scenarios with large or richly-featured geographical volumes, however, parallel processing is required to meet the memory and computation time demands. Here, we present a scalable and efficient parallel execution of a recently-developed event-based radio signal propagation model. We demonstrate its scalability to thousands of processors, with parallel speedups over 1000x. The speed and scale achieved by our parallel execution allow for larger scenarios and faster execution than has ever been reported before.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115105746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
LeWI: A Runtime Balancing Algorithm for Nested Parallelism LeWI:一种嵌套并行的运行时平衡算法
2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI: 10.1109/ICPP.2009.56
Marta Garcia, J. Corbalán, J. Labarta
{"title":"LeWI: A Runtime Balancing Algorithm for Nested Parallelism","authors":"Marta Garcia, J. Corbalán, J. Labarta","doi":"10.1109/ICPP.2009.56","DOIUrl":"https://doi.org/10.1109/ICPP.2009.56","url":null,"abstract":"We present LeWI: a novel load balancing algorithm, that can balance applications with very different patterns of imbalance. Our algorithm can balance fine grain imbalances, non iterative applications and applications with irregular imbalance. To achieve this LeWI reassigns the computational resources of blocked processes to other processes more loaded. We have implemented LeWI within DLB a Dynamic Load Balancing Library developed by us. DLB helps parallel programming models to make the most of the computational power available with the minimum effort. It solves the imbalance among processes in applications with two levels of parallelism using the malleability of the inner level. The performance evaluation shows that LeWI, the novel balancing algorithm we are presenting in this paper, together with DLB is able to improve the performance of a different range of unbalanced applications and when applied to well balanced applications it does not introduce significant overhead. Therefore we present a mechanism that can be used with any hybrid application without needing a programmer to analyze the application nor modify it.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"651 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123347843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Group Operation Assembly Language - A Flexible Way to Express Collective Communication 群操作汇编语言——一种灵活的表达集体交流的方式
2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI: 10.1109/ICPP.2009.70
T. Hoefler, Christian Siebert, A. Lumsdaine
{"title":"Group Operation Assembly Language - A Flexible Way to Express Collective Communication","authors":"T. Hoefler, Christian Siebert, A. Lumsdaine","doi":"10.1109/ICPP.2009.70","DOIUrl":"https://doi.org/10.1109/ICPP.2009.70","url":null,"abstract":"The implementation and optimization of collective communication operations is an important field of active research. Such operations directly influence application performance and need to map the communication requirements in an optimal way to steadily changing network architectures. In this work, we define an abstract domain-specific language to express arbitrary group communication operations. We show the universality of this language and how all existing collective operations can be implemented with it. By design, it readily lends itself to blocking and nonblocking execution, as well as to off-loaded execution of complex group communication operations. We also define several offline and online optimizations (compiler transformations and scheduling decisions, respectively) to improve the overall performance of the operation. Performance results show that the overhead to express current collective operations is negligible in comparison to the potential gains in a highly optimized implementation.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"360 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121721591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信