2011 IEEE International Parallel & Distributed Processing Symposium最新文献_第9页

Co-analysis of RAS Log and Job Log on Blue Gene/P Blue Gene/P上RAS日志与Job日志的联合分析

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.83

Ziming Zheng, Li Yu, Wei Tang, Z. Lan, Rinku Gupta, N. Desai, S. Coghlan, Daniel Buettner

引用次数: 82

Accelerating Protein Sequence Search in a Heterogeneous Computing System 异构计算系统中加速蛋白质序列搜索

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.115

S. Xiao, Heshan Lin, Wu-chun Feng

引用次数: 37

RDMA Capable iWARP over Datagrams 支持RDMA的iWARP数据报

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.66

Ryan E. Grant, Mohammad J. Rashti, A. Afsahi, P. Balaji

{"title":"RDMA Capable iWARP over Datagrams","authors":"Ryan E. Grant, Mohammad J. Rashti, A. Afsahi, P. Balaji","doi":"10.1109/IPDPS.2011.66","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.66","url":null,"abstract":"iWARP is a state of the art high-speed connection-based RDMA networking technology for Ethernet networks to provide InfiniBand-like zero-copy and one-sided communication capabilities over Ethernet. Despite the benefits offered by iWARP, many data center and web-based applications, such as stock-market trading and media-streaming applications, that rely on data gram-based semantics (mostly through UDP/IP) cannot take advantage of it because the iWARP standard is only defined over reliable, connection-oriented transports. This paper presents an RDMA model that functions over reliable and unreliable data grams. The ability to use data grams significantly expands the application space serviced by iWARP and can bring the scalability advantages of a connectionless transport to iWARP. In our previous work, we had developed an iWARP data gram solution using send/receive semantics showing excellent memory scalability and performance benefits over the current TCP-based iWARP. In this paper, we demonstrate an improved iWARP design that provides true RDMA semantics over data grams. Specifically, because traditional RDMA semantics do not map well to unreliable communication, we propose RDMA Write-Record, the first and the only method capable of supporting RDMA Write over both unreliable and reliable data grams. We demonstrate through a proof-of-concept software implementation that data gram-iWARP is feasible for real-world applications. Our proposed RDMA Write-Record method has been designed with data loss in mind and can provide superior performance under conditions of packet loss. It is shown through micro-benchmarks that by using RDMA capable data gram-iWARP a maximum of 256% increase in large message bandwidth and a maximum of 24.4% improvement in small message latency can be achieved over traditional iWARP. For application results we focus on streaming applications, showing a 24% improvement in memory usage and up to a 74% improvement in performance, although the proposed approach is also applicable to the HPC domain.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125357312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Willow: A Control System for Energy and Thermal Adaptive Computing 一种能量和热自适应计算控制系统

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.14

K. Kant, M. Murugan, D. Du

引用次数: 29

Multifrontal Factorization of Sparse SPD Matrices on GPUs gpu上稀疏SPD矩阵的多额分解

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.44

Thomas George, Vaibhav Saxena, Anshul Gupta, Amik Singh, Anamitra R. Choudhury

{"title":"Multifrontal Factorization of Sparse SPD Matrices on GPUs","authors":"Thomas George, Vaibhav Saxena, Anshul Gupta, Amik Singh, Anamitra R. Choudhury","doi":"10.1109/IPDPS.2011.44","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.44","url":null,"abstract":"Solving large sparse linear systems is often the most computationally intensive component of many scientific computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-efficiency, and the potential to achieve significant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored sufficiently due to the complexity involved in an efficient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classifier. We estimate the classifier parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124724087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Using Shared Memory to Accelerate MapReduce on Graphics Processing Units 使用共享内存加速图形处理单元上的MapReduce

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.80

Feng Ji, Xiaosong Ma

{"title":"Using Shared Memory to Accelerate MapReduce on Graphics Processing Units","authors":"Feng Ji, Xiaosong Ma","doi":"10.1109/IPDPS.2011.80","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.80","url":null,"abstract":"Modern General Purpose Graphics Processing Units (GPGPUs) provide high degrees of parallelism in computation and memory access, making them suitable for data parallel applications such as those using the elastic MapReduce model. Yet designing a MapReduce framework for GPUs faces significant challenges brought by their multi-level memory hierarchy. Due to the absence of atomic operations in the earlier generations of GPUs, existing GPU MapReduce frameworks have problems in handling input/output data with varied or unpredictable sizes. Also, existing frameworks utilize mostly a single level of memory, emph{i.e.}, the relatively spacious yet slow global memory. In this work, we attempt to explore the potential benefit of enabling a GPU MapReduce framework to use multiple levels of the GPU memory hierarchy. We propose a novel GPU data staging scheme for MapReduce workloads, tailored toward the GPU memory hierarchy. Centering around the efficient utilization of the fast but very small shared memory, we designed and implemented a GPU MapReduce framework, whose key techniques include (1) shared memory staging area management, (2) thread-role partitioning, and (3) intra-block thread synchronization. We carried out evaluation with five popular MapReduce workloads and studied their performance under different GPU memory usage choices. Our results reveal that exploiting GPU shared memory is highly promising for the Map phase (with an average 2.85x speedup over using global memory only), while in the Reduce phase the benefit of using shared memory is much less pronounced, due to the high input-to-output ratio. In addition, when compared to Mars, an existing GPU MapReduce framework, our system is shown to bring a significant speedup in Map/Reduce phases.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124731841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers 混合并行计算机平流的重叠计算与通信

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.16

J. White, J. Dongarra

引用次数: 24

X10 as a Parallel Language for Scientific Computation: Practice and Experience X10作为科学计算的并行语言:实践与经验

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.103

Josh Milthorpe, V. Ganesh, Alistair P. Rendell, D. Grove

{"title":"X10 as a Parallel Language for Scientific Computation: Practice and Experience","authors":"Josh Milthorpe, V. Ganesh, Alistair P. Rendell, D. Grove","doi":"10.1109/IPDPS.2011.103","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.103","url":null,"abstract":"X10 is an emerging Partitioned Global Address Space (PGAS) language intended to increase significantly the productivity of developing scalable HPC applications. The language has now matured to a point where it is meaningful to consider writing large scale scientific application codes in X10. This paper reports our experiences writing three codes from the chemistry/material science domain: Fast Multipole Method (FMM), Particle Mesh Ewald (PME) and Hartree-Fock (HF), entirely in X10. Performance results are presented for up to 256 places on a Blue Gene/P system. During the course of this work our experiences have been shared with the X10 development team, so that application requirements could inform language design discussions as the language capabilities influenced algorithm design. This resulted in improvements in the language implementation and standard class libraries, including the design of the array API and support for complex math. Data constructs in X10 such as emph{places} and emph{distributed arrays}, and parallel constructs such as emph{finish} and emph{async}, simplify implementation of the applications in comparison with MPI. However, current implementation limitations in X10 2.1.2 make it difficult to achieve scalable performance using the most natural expressions of the algorithms. The most serious limitation is the use of point-to-point communication patterns, rather than collectives, to implement parallel constructs and array operations. This issue will be addressed in future releases of X10.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134390953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Completely Distributed Particle Filters for Target Tracking in Sensor Networks 用于传感器网络目标跟踪的完全分布式粒子滤波

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.40

Bo Jiang, B. Ravindran

{"title":"Completely Distributed Particle Filters for Target Tracking in Sensor Networks","authors":"Bo Jiang, B. Ravindran","doi":"10.1109/IPDPS.2011.40","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.40","url":null,"abstract":"Particle filters (or PFs) are widely used for the tracking problem in dynamic systems. Despite their remarkable tracking performance and flexibility, PFs require intensive computation and communication, which are strictly constrained in wireless sensor networks (or WSNs). Thus, distributed particle filters (or DPFs) have been studied to distribute the computational workload onto multiple nodes while minimizing the communication among them. However, weight normalization and resampling in generic PFs cause significant challenges in the distributed implementation. Few existing efforts on DPF could be implemented in a completely distributed manner. In this paper, we design a completely distributed particle filter (or CDPF) for target tracking in sensor networks, and further improve it with neighborhood estimation toward minimizing the communication cost. First, we describe the particle maintenance and propagation mechanism, by which particles are maintained on different sensor nodes and propagated along the target trajectory. Then, we design the CDPF algorithm by adjusting the order of PFs' four steps and leveraging the data aggregation during particle propagation. Finally, we develop a neighborhood estimation method to replace the measurement broadcasting and the calculation of likelihood functions. With this approximate estimation, the communication cost of DPFs can be minimized. Our experimental evaluations show that although CDPF incurs about $50%$ more estimation error than semi-distributed particle filter (or SDPF), its communication cost is lower than that of SDPF by as much as $90%$.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133602129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

DryadOpt: Branch-and-Bound on Distributed Data-Parallel Execution Engines DryadOpt:分布式数据并行执行引擎上的分支绑定

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.121

M. Budiu, D. Delling, Renato F. Werneck

{"title":"DryadOpt: Branch-and-Bound on Distributed Data-Parallel Execution Engines","authors":"M. Budiu, D. Delling, Renato F. Werneck","doi":"10.1109/IPDPS.2011.121","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.121","url":null,"abstract":"We introduce Dryad Opt, a library that enables massively parallel and distributed execution of optimization algorithms for solving hard problems. Dryad Opt performs an exhaustive search of the solution space using branch-and-bound, by recursively splitting the original problem into many simpler sub problems. It uses both parallelism (at the core level) and distributed execution (at the machine level). Dryad Opt provides a simple yet powerful interface to its users, who only need to implement sequential code to process individual sub problems (either by solving them in full or generating new sub problems). The parallelism and distribution are handled automatically by Dryad Opt, and are invisible to the user. The distinctive feature of our system is that it is implemented on top of Dryad LINQ, a distributed data-parallel execution engine similar to Hadoop and Map-Reduce. Despite the fact that these engines offer a constrained application model, with restricted communication patterns, our experiments show that careful design choices allow Dryad Opt to scale linearly with the number of machines, with very little overhead.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117009234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27