2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献_第3页

New Effective Multithreaded Matching Algorithms 新的高效多线程匹配算法

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.61

F. Manne, M. Halappanavar

引用次数: 45

How Well Do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis 图形处理平台的性能如何?实证绩效评价与分析

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.49

Yong Guo, M. Biczak, A. Varbanescu, A. Iosup, Claudio Martella, Theodore L. Willke

引用次数: 118

s-Step Krylov Subspace Methods as Bottom Solvers for Geometric Multigrid 几何多重网格的s步Krylov子空间底解方法

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.119

Samuel Williams, M. Lijewski, A. Almgren, B. V. Straalen, E. Carson, Nicholas Knight, J. Demmel

{"title":"s-Step Krylov Subspace Methods as Bottom Solvers for Geometric Multigrid","authors":"Samuel Williams, M. Lijewski, A. Almgren, B. V. Straalen, E. Carson, Nicholas Knight, J. Demmel","doi":"10.1109/IPDPS.2014.119","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.119","url":null,"abstract":"Geometric multigrid solvers within adaptive mesh refinement (AMR) applications often reach a point where further coarsening of the grid becomes impractical as individual sub domain sizes approach unity. At this point the most common solution is to use a bottom solver, such as BiCGStab, to reduce the residual by a fixed factor at the coarsest level. Each iteration of BiCGStab requires multiple global reductions (MPI collectives). As the number of BiCGStab iterations required for convergence grows with problem size, and the time for each collective operation increases with machine scale, bottom solves in large-scale applications can constitute a significant fraction of the overall multigrid solve time. In this paper, we implement, evaluate, and optimize a communication-avoiding s-step formulation of BiCGStab (CABiCGStab for short) as a high-performance, distributed-memory bottom solver for geometric multigrid solvers. This is the first time s-step Krylov subspace methods have been leveraged to improve multigrid bottom solver performance. We use a synthetic benchmark for detailed analysis and integrate the best implementation into BoxLib in order to evaluate the benefit of a s-step Krylov subspace method on the multigrid solves found in the applications LMC and Nyx on up to 32,768 cores on the Cray XE6 at NERSC. Overall, we see bottom solver improvements of up to 4.2x on synthetic problems and up to 2.7x in real applications. This results in as much as a 1.5x improvement in solver performance in real applications.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114953312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Parallel Mutual Information Based Construction of Whole-Genome Networks on the Intel (R) Xeon Phi (TM) Coprocessor 基于Intel (R) Xeon Phi (TM)协处理器的并行互信息全基因组网络构建

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.35

Sanchit Misra, K. Pamnany, S. Aluru

{"title":"Parallel Mutual Information Based Construction of Whole-Genome Networks on the Intel (R) Xeon Phi (TM) Coprocessor","authors":"Sanchit Misra, K. Pamnany, S. Aluru","doi":"10.1109/IPDPS.2014.35","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.35","url":null,"abstract":"Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel (R) Xeon Phi (TM) coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel (R) Xeon (R) processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"260 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123469740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Computational Co-design of a Multiscale Plasma Application: A Process and Initial Results 多尺度等离子体应用的计算协同设计:过程和初步结果

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.114

J. Payne, D. Knoll, A. McPherson, W. Taitano, L. Chacón, Guangye Chen, S. Pakin

引用次数: 1

LFTI: A New Performance Metric for Assessing Interconnect Designs for Extreme-Scale HPC Systems LFTI:一种评估超大规模HPC系统互连设计的新性能指标

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.38

Xin Yuan, S. Mahapatra, M. Lang, S. Pakin

{"title":"LFTI: A New Performance Metric for Assessing Interconnect Designs for Extreme-Scale HPC Systems","authors":"Xin Yuan, S. Mahapatra, M. Lang, S. Pakin","doi":"10.1109/IPDPS.2014.38","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.38","url":null,"abstract":"Traditionally, interconnect performance is either characterized by simple topological parameters such as bisection bandwidth or studied through simulation that gives detailed performance information for the scenarios simulated. Neither of these approaches provides a good performance overview for extreme-scale interconnects. The topological parameters are not directly related to application level communication performance while the simulation complexity limits the number of scenarios that can be investigated. In this work, we propose a new performance metric, called LANL-FSU Throughput Indices (LFTI), for characterizing the throughput performance of interconnect designs. LFTI combines the simplicity of topological parameters and the accuracy of simulation: like topological parameters, LFTI can be derived from interconnect specification, at the same time, it directly reflects the application level communication performance. Moreover, in cases when the theoretical throughput for each communication pattern can be modeled efficiently for an interconnect, LFTI for the interconnect can be computed efficiently. These features potentially allow LFTI to be used for rapid and comprehensive evaluation and comparison of extreme-scale interconnect designs. We demonstrate the effectiveness of LFTI by using it to evaluate and explore the design space of a number of large-scale interconnect designs.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125884726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

RCMP: Enabling Efficient Recomputation Based Failure Resilience for Big Data Analytics RCMP:为大数据分析实现基于故障恢复的高效重计算

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.102

Florin Dinu, T. Ng

{"title":"RCMP: Enabling Efficient Recomputation Based Failure Resilience for Big Data Analytics","authors":"Florin Dinu, T. Ng","doi":"10.1109/IPDPS.2014.102","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.102","url":null,"abstract":"Data replication, the main failure resilience strategy used for big data analytics jobs, can be unnecessarily inefficient. It can cause serious performance degradation when applied to intermediate job outputs in multi-job computations. For instance, for I/O-intensive big data jobs, data replication is especially expensive because very large datasets need to be replicated. Reducing the number of replicas is not a satisfactory solution as it only aggravates a fundamental limitation of data replication: its failure resilience guarantees are limited by the number of available replicas. When all replicas of some piece of intermediate job output are lost, cascading job recomputations may be required for recovery. In this paper we show how job recomputation can be made a first-order failure resilience strategy for big data analytics. The need for data replication can thus be significantly reduced. We present RCMP, a system that performs efficient job recomputation. RCMP can persist task outputs across jobs and leverage them to minimize the work performed during job recomputations. More importantly, RCMP addresses two important challenges that appear during job recomputations. The first is efficiently utilizing the available compute node parallelism. The second is dealing with hot-spots. RCMP handles both by switching to a finer-grained task scheduling granularity for recomputations. Our experiments show that RCMP's benefits hold across two different clusters, for job inputs as small as 40GB or as large as 1.2TB. Compared to RCMP, data replication is 30%-100% worse during failure-free periods. More importantly, by efficiently performing recomputations, RCMP is comparable or better even under single and double data loss events.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129463872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

ReDHiP: Recalibrating Deep Hierarchy Prediction for Energy Efficiency ReDHiP:重新校准能源效率的深度层次预测

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.98

Xun Li, D. Franklin, R. Bianchini, F. Chong

引用次数: 1

Remote Invalidation: Optimizing the Critical Path of Memory Transactions 远程失效:优化内存事务的关键路径

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.30

Ahmed Hassan, R. Palmieri, B. Ravindran

{"title":"Remote Invalidation: Optimizing the Critical Path of Memory Transactions","authors":"Ahmed Hassan, R. Palmieri, B. Ravindran","doi":"10.1109/IPDPS.2014.30","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.30","url":null,"abstract":"Software Transactional Memory (STM) systems are increasingly emerging as a promising alternative to traditional locking algorithms for implementing generic concurrent applications. To achieve generality, STM systems incur overheads to the normal sequential execution path, including those due to spin locking, validation (or invalidation), and commit/abort routines. We propose a new STM algorithm called Remote Invalidation (or RInval) that reduces these overheads and improves STM performance. RInval's main idea is to execute commit and invalidation routines on remote server threads that run on dedicated cores, and use cache-aligned communication between application's transactional threads and the server routines. By remote execution of commit and invalidation routines and cache-aligned communication, RInval reduces the overhead of spin locking and cache misses on shared locks. By running commit and invalidation on separate cores, they become independent of each other, increasing commit concurrency. We implemented RInval in the Rochester STM framework. Our experimental studies on micro-benchmarks and the STAMP benchmark reveal that RInval outperforms InvalSTM, the corresponding non-remote invalidation algorithm, by as much as an order of magnitude. Additionally, RInval obtains competitive performance to validation-based STM algorithms such as NOrec, yielding up to 2x performance improvement.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129002098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Analytically Modeling Application Execution for Software-Hardware Co-design 面向软硬件协同设计的应用执行分析建模

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.56

Jichi Guo, Jiayuan Meng, Qing Yi, V. Morozov, Kalyan Kumaran

{"title":"Analytically Modeling Application Execution for Software-Hardware Co-design","authors":"Jichi Guo, Jiayuan Meng, Qing Yi, V. Morozov, Kalyan Kumaran","doi":"10.1109/IPDPS.2014.56","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.56","url":null,"abstract":"Software-hardware co-design has become increasingly important as the scale and complexity of both are reaching an unprecedented level. To predict and understand application behavior on emerging or conceptual systems, existing research has mostly relied on cycle-accurate micro-architecture simulators, which are known to be time-consuming and are oblivious to workloads' control flow structure. As a result, simulations are often limited to small kernels, and the first step in the co-design process is often to extract important kernels, construct mini-applications, and identify potential hardware limitations. This requires a high level understanding about the full applications' potential behavior on a future system, e.g. the most time-consuming regions, the performance bottlenecks for these regions, etc. Unfortunately, such application knowledge gained from one system may not hold true on a future system. One solution is to instrument the full application with timers and simulate it with a reasonable input size, which can be a daunting task in itself. We propose an alternative approach to gain first-order insights into hardware-dependent application behavior by trading off the accuracy of analysis for improved efficiency. By modeling the execution flows of user applications and analyzing it using target hardware's performance models, our technique requires no cycle-accurate simulation on a prospective system. In fact, our technique's analysis time does not increase with the input data size.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115155677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6