Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing最新文献_第2页

Acceleration of Large-Scale Electronic Structure Simulations with Heterogeneous Parallel Computing 基于异构并行计算的大规模电子结构仿真加速

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2018-11-05 DOI: 10.5772/INTECHOPEN.80997

Oh-Kyoung Kwon, H. Ryu

{"title":"Acceleration of Large-Scale Electronic Structure Simulations with Heterogeneous Parallel Computing","authors":"Oh-Kyoung Kwon, H. Ryu","doi":"10.5772/INTECHOPEN.80997","DOIUrl":"https://doi.org/10.5772/INTECHOPEN.80997","url":null,"abstract":"Large-scale electronic structure simulations coupled to an empirical modeling approach are critical as they present a robust way to predict various quantum phe-nomena in realistically sized nanoscale structures that are hard to be handled with density functional theory. For tight-binding (TB) simulations of electronic structures that normally involve multimillion atomic systems for a direct comparison to experimentally realizable nanoscale materials and devices, we show that graphical processing unit (GPU) devices help in saving computing costs in terms of time and energy consumption. With a short introduction of the major numerical method adopted for TB simulations of electronic structures, this work presents a detailed description for the strategies to drive performance enhancement with GPU devices against traditional clusters of multicore processors. While this work only uses TB electronic structure simulations for benchmark tests, it can be also utilized as a practical guideline to enhance performance of numerical operations that involve large-scale sparse matrices.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"36 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91479822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods 新和:一种新的用于一般迭代方法的在线ABFT格式

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907306

Dingwen Tao, S. Song, S. Krishnamoorthy, Panruo Wu, Xin Liang, E. Zhang, D. Kerbyson, Zizhong Chen

{"title":"New-Sum: A Novel Online ABFT Scheme For General Iterative Methods","authors":"Dingwen Tao, S. Song, S. Krishnamoorthy, Panruo Wu, Xin Liang, E. Zhang, D. Kerbyson, Zizhong Chen","doi":"10.1145/2907294.2907306","DOIUrl":"https://doi.org/10.1145/2907294.2907306","url":null,"abstract":"Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73545348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Scaling Spark on HPC Systems 在HPC系统上扩展Spark

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907310

Nicholas Chaimov, A. Malony, S. Canon, Costin Iancu, K. Ibrahim, Jayanth Srinivasan

{"title":"Scaling Spark on HPC Systems","authors":"Nicholas Chaimov, A. Malony, S. Canon, Costin Iancu, K. Ibrahim, Jayanth Srinivasan","doi":"10.1145/2907294.2907310","DOIUrl":"https://doi.org/10.1145/2907294.2907310","url":null,"abstract":"We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8x. On the hardware side we evaluate a system with a large NVRAM buffer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(10^4). As our analysis indicates, it is feasible to observe much higher scalability in the near future.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80865722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 79

BAShuffler: Maximizing Network Bandwidth Utilization in the Shuffle of YARN Shuffle:最大限度地提高YARN Shuffle的网络带宽利用率

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907296

Feng Liang, F. Lau

引用次数: 8

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing 第25届ACM高性能并行和分布式计算国际研讨会论文集

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2016-05-31 DOI: 10.1145/2907294

H. Nakashima, K. Taura, Jack Lange

{"title":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","authors":"H. Nakashima, K. Taura, Jack Lange","doi":"10.1145/2907294","DOIUrl":"https://doi.org/10.1145/2907294","url":null,"abstract":"Welcome to the 25th ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'16). HPDC'16 follows the tradition of previous versions of the conference by providing a high-quality, single-track forum for presenting new research results on all aspects of the design, implementation, evaluation, and application of parallel and distributed systems for high-end computing. The HPDC'16 program features eight sessions that cover wide range of topics including high performance networking, parallel algorithms, algorithm-based fault tolerance, big data processing, I/O optimizations, non-volatile memory, cloud, resource management, many core systems, GPUs, graph processing algorithms, and more. In these sessions, not only full papers but also short papers are presented to give a mix of novel research directions at various stages of development, which also is exhibited by a number of posters. This program is complemented by an interesting set of six workshops, FTXS, HPGP, SEM4HPC, DIDC, ROSS and ScienceCloud, on a range of timely and related systems and application topics. \u0000 \u0000The conference program also features three keynote/invited talks given by Dr. Jeffrey Vetter of Oak Ridge National Laboratory, Professor Jack Dongarra of University of Tennessee, and Professor Ada Gavrilovska of Georgia Tech to memorialize the late Professor Karsten Schwan of Georgia Tech. \u0000 \u0000Jack Dongarra is the recipient of the 5th HPDC Annual Achievement Award. The purpose of this award is to recognize individuals who have made long lasting, influential contributions to the foundations or practice of the field of high-performance parallel and distributed computing, to raise the awareness of these contributions, especially among the younger generation of researchers, and to improve the image and the public relations of the HPDC community. The Award Selection Committee followed the formalized process established in 2013 to select the winner with an open call for nominations. \u0000 \u0000The HPDC'16 call for papers attracted 129 paper submissions. In the review process this year, we followed two established methods that were started in 2012: a two-round review process and an author rebuttal process. In the first round review, all papers received at least three reviews, and based on these reviews, 71 papers went on to the second round in which most of them received another two reviews. In total, 514 reviews were generated by the 54-member Program Committee along with a number of external reviewers. For many of the 71 second-round papers, the authors submitted rebuttals. Rebuttals were carefully taken into consideration during the Program Committee deliberations as part of the selection process. On March 10-11, the Program Committee met at University of Pittsburgh (Pittsburgh, PA) and made the final selection. Each paper in the second round of reviews was discussed at the meeting. At the end of the 1.5-day meeting, the Program Committee accepted 20 full papers, resulting in an acce","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87384861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Network-Managed Virtual Global Address Space for Message-driven Runtimes 用于消息驱动运行时的网络管理虚拟全局地址空间

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907320

Abhishek Kulkarni, Luke Dalessandro, E. Kissel, A. Lumsdaine, T. Sterling, M. Swany

引用次数: 5

Session details: Parallel and Fault Tolerant algorithms 会话细节:并行和容错算法

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2016-05-31 DOI: 10.1145/3257970

A. Butt

引用次数: 0

Efficient Processing of Large Graphs via Input Reduction 通过减少输入的高效处理大型图

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907312

Amlan Kusum, Keval Vora, Rajiv Gupta, Iulian Neamtiu

{"title":"Efficient Processing of Large Graphs via Input Reduction","authors":"Amlan Kusum, Keval Vora, Rajiv Gupta, Iulian Neamtiu","doi":"10.1145/2907294.2907312","DOIUrl":"https://doi.org/10.1145/2907294.2907312","url":null,"abstract":"Large-scale parallel graph analytics involves executing iterative algorithms (e.g., PageRank, Shortest Paths, etc.) that are both data- and compute-intensive. In this work we construct faster versions of iterative graph algorithms from their original counterparts using input graph reduction. A large input graph is transformed into a small graph using a sequence of input reduction transformations. Savings in execution time are achieved using our two phased processing model that effectively runs the original iterative algorithm in two phases: first, using the reduced input graph to gain savings in execution time; and second, using the original input graph along with the results from the first phase for computing precise results. We propose several input reduction transformations and identify the structural and non-structural properties that they guarantee, which in turn are used to ensure the correctness of results while using our two phased processing model. We further present a unified input reduction algorithm that efficiently applies a non-interfering sequence of simple local input reduction transformations. Our experiments show that our transformation techniques enable significant reductions in execution time (1.25x-2.14x) while achieving precise final results for most of the algorithms. For cases where precise results cannot be achieved, the relative error remains very small (at most 0.065).","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75144414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

High-Performance Distributed RMA Locks 高性能分布式RMA锁

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907323

P. Schmid, Maciej Besta, T. Hoefler

{"title":"High-Performance Distributed RMA Locks","authors":"P. Schmid, Maciej Besta, T. Hoefler","doi":"10.1145/2907294.2907323","DOIUrl":"https://doi.org/10.1145/2907294.2907323","url":null,"abstract":"We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81% and 73%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89445110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Evaluation of Pattern Matching Workloads in Graph Analysis Systems 图分析系统中模式匹配工作量的评估

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2016-05-31 DOI: 10.1145/2907294.2907305

Seokyong Hong, S. Lee, Seung-Hwan Lim, S. Sukumar, Ranga Raju Vatsavai

引用次数: 4