Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing最新文献

筛选
英文 中文
Acceleration of Large-Scale Electronic Structure Simulations with Heterogeneous Parallel Computing 基于异构并行计算的大规模电子结构仿真加速
Oh-Kyoung Kwon, H. Ryu
{"title":"Acceleration of Large-Scale Electronic Structure Simulations with Heterogeneous Parallel Computing","authors":"Oh-Kyoung Kwon, H. Ryu","doi":"10.5772/INTECHOPEN.80997","DOIUrl":"https://doi.org/10.5772/INTECHOPEN.80997","url":null,"abstract":"Large-scale electronic structure simulations coupled to an empirical modeling approach are critical as they present a robust way to predict various quantum phe-nomena in realistically sized nanoscale structures that are hard to be handled with density functional theory. For tight-binding (TB) simulations of electronic structures that normally involve multimillion atomic systems for a direct comparison to experimentally realizable nanoscale materials and devices, we show that graphical processing unit (GPU) devices help in saving computing costs in terms of time and energy consumption. With a short introduction of the major numerical method adopted for TB simulations of electronic structures, this work presents a detailed description for the strategies to drive performance enhancement with GPU devices against traditional clusters of multicore processors. While this work only uses TB electronic structure simulations for benchmark tests, it can be also utilized as a practical guideline to enhance performance of numerical operations that involve large-scale sparse matrices.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"36 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91479822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods 新和:一种新的用于一般迭代方法的在线ABFT格式
Dingwen Tao, S. Song, S. Krishnamoorthy, Panruo Wu, Xin Liang, E. Zhang, D. Kerbyson, Zizhong Chen
{"title":"New-Sum: A Novel Online ABFT Scheme For General Iterative Methods","authors":"Dingwen Tao, S. Song, S. Krishnamoorthy, Panruo Wu, Xin Liang, E. Zhang, D. Kerbyson, Zizhong Chen","doi":"10.1145/2907294.2907306","DOIUrl":"https://doi.org/10.1145/2907294.2907306","url":null,"abstract":"Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73545348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Scaling Spark on HPC Systems 在HPC系统上扩展Spark
Nicholas Chaimov, A. Malony, S. Canon, Costin Iancu, K. Ibrahim, Jayanth Srinivasan
{"title":"Scaling Spark on HPC Systems","authors":"Nicholas Chaimov, A. Malony, S. Canon, Costin Iancu, K. Ibrahim, Jayanth Srinivasan","doi":"10.1145/2907294.2907310","DOIUrl":"https://doi.org/10.1145/2907294.2907310","url":null,"abstract":"We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8x. On the hardware side we evaluate a system with a large NVRAM buffer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(10^4). As our analysis indicates, it is feasible to observe much higher scalability in the near future.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80865722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
BAShuffler: Maximizing Network Bandwidth Utilization in the Shuffle of YARN Shuffle:最大限度地提高YARN Shuffle的网络带宽利用率
Feng Liang, F. Lau
{"title":"BAShuffler: Maximizing Network Bandwidth Utilization in the Shuffle of YARN","authors":"Feng Liang, F. Lau","doi":"10.1145/2907294.2907296","DOIUrl":"https://doi.org/10.1145/2907294.2907296","url":null,"abstract":"YARN is a popular cluster resource management platform. It does not, however, manage the network bandwidth resources which can significantly affect the execution performance of those tasks having large volumes of data to transfer within the cluster. The shuffle phase of MapReduce jobs features many such tasks. The impact of under utilization of the network bandwidth in shuffle tasks is more pronounced if the network bandwidth capacities of the nodes in the cluster are varied. We present BAShuffler, a bandwidth-aware shuffle scheduler, that can maximize the overall network bandwidth utilization by scheduling the source nodes of the fetch flows at the application level. BAShuffler can fully utilize the network bandwidth capacity in a max-min fair network. The experimental results for a variety of realistic benchmarks show that BAShuffler can substantially improve the cluster's shuffle throughput and reduce the execution time of shuffle tasks as compared to the original YARN, especially in heterogeneous network bandwidth environments.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88893093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing 第25届ACM高性能并行和分布式计算国际研讨会论文集
H. Nakashima, K. Taura, Jack Lange
{"title":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","authors":"H. Nakashima, K. Taura, Jack Lange","doi":"10.1145/2907294","DOIUrl":"https://doi.org/10.1145/2907294","url":null,"abstract":"Welcome to the 25th ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'16). HPDC'16 follows the tradition of previous versions of the conference by providing a high-quality, single-track forum for presenting new research results on all aspects of the design, implementation, evaluation, and application of parallel and distributed systems for high-end computing. The HPDC'16 program features eight sessions that cover wide range of topics including high performance networking, parallel algorithms, algorithm-based fault tolerance, big data processing, I/O optimizations, non-volatile memory, cloud, resource management, many core systems, GPUs, graph processing algorithms, and more. In these sessions, not only full papers but also short papers are presented to give a mix of novel research directions at various stages of development, which also is exhibited by a number of posters. This program is complemented by an interesting set of six workshops, FTXS, HPGP, SEM4HPC, DIDC, ROSS and ScienceCloud, on a range of timely and related systems and application topics. \u0000 \u0000The conference program also features three keynote/invited talks given by Dr. Jeffrey Vetter of Oak Ridge National Laboratory, Professor Jack Dongarra of University of Tennessee, and Professor Ada Gavrilovska of Georgia Tech to memorialize the late Professor Karsten Schwan of Georgia Tech. \u0000 \u0000Jack Dongarra is the recipient of the 5th HPDC Annual Achievement Award. The purpose of this award is to recognize individuals who have made long lasting, influential contributions to the foundations or practice of the field of high-performance parallel and distributed computing, to raise the awareness of these contributions, especially among the younger generation of researchers, and to improve the image and the public relations of the HPDC community. The Award Selection Committee followed the formalized process established in 2013 to select the winner with an open call for nominations. \u0000 \u0000The HPDC'16 call for papers attracted 129 paper submissions. In the review process this year, we followed two established methods that were started in 2012: a two-round review process and an author rebuttal process. In the first round review, all papers received at least three reviews, and based on these reviews, 71 papers went on to the second round in which most of them received another two reviews. In total, 514 reviews were generated by the 54-member Program Committee along with a number of external reviewers. For many of the 71 second-round papers, the authors submitted rebuttals. Rebuttals were carefully taken into consideration during the Program Committee deliberations as part of the selection process. On March 10-11, the Program Committee met at University of Pittsburgh (Pittsburgh, PA) and made the final selection. Each paper in the second round of reviews was discussed at the meeting. At the end of the 1.5-day meeting, the Program Committee accepted 20 full papers, resulting in an acce","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87384861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Network-Managed Virtual Global Address Space for Message-driven Runtimes 用于消息驱动运行时的网络管理虚拟全局地址空间
Abhishek Kulkarni, Luke Dalessandro, E. Kissel, A. Lumsdaine, T. Sterling, M. Swany
{"title":"Network-Managed Virtual Global Address Space for Message-driven Runtimes","authors":"Abhishek Kulkarni, Luke Dalessandro, E. Kissel, A. Lumsdaine, T. Sterling, M. Swany","doi":"10.1145/2907294.2907320","DOIUrl":"https://doi.org/10.1145/2907294.2907320","url":null,"abstract":"Maintaining a scalable high-performance virtual global address space using distributed memory hardware has proven to be challenging. In this paper we evaluate a new approach for such an active global address space that leverages the capabilities of the network fabric to manage addressing, rather than software at the endpoint hosts. We describe our overall approach, design alternatives, and present initial experimental results that demonstrate the effectiveness and limitations of existing network hardware.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86659913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
High-Performance Distributed RMA Locks 高性能分布式RMA锁
P. Schmid, Maciej Besta, T. Hoefler
{"title":"High-Performance Distributed RMA Locks","authors":"P. Schmid, Maciej Besta, T. Hoefler","doi":"10.1145/2907294.2907323","DOIUrl":"https://doi.org/10.1145/2907294.2907323","url":null,"abstract":"We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81% and 73%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89445110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Efficient Processing of Large Graphs via Input Reduction 通过减少输入的高效处理大型图
Amlan Kusum, Keval Vora, Rajiv Gupta, Iulian Neamtiu
{"title":"Efficient Processing of Large Graphs via Input Reduction","authors":"Amlan Kusum, Keval Vora, Rajiv Gupta, Iulian Neamtiu","doi":"10.1145/2907294.2907312","DOIUrl":"https://doi.org/10.1145/2907294.2907312","url":null,"abstract":"Large-scale parallel graph analytics involves executing iterative algorithms (e.g., PageRank, Shortest Paths, etc.) that are both data- and compute-intensive. In this work we construct faster versions of iterative graph algorithms from their original counterparts using input graph reduction. A large input graph is transformed into a small graph using a sequence of input reduction transformations. Savings in execution time are achieved using our two phased processing model that effectively runs the original iterative algorithm in two phases: first, using the reduced input graph to gain savings in execution time; and second, using the original input graph along with the results from the first phase for computing precise results. We propose several input reduction transformations and identify the structural and non-structural properties that they guarantee, which in turn are used to ensure the correctness of results while using our two phased processing model. We further present a unified input reduction algorithm that efficiently applies a non-interfering sequence of simple local input reduction transformations. Our experiments show that our transformation techniques enable significant reductions in execution time (1.25x-2.14x) while achieving precise final results for most of the algorithms. For cases where precise results cannot be achieved, the relative error remains very small (at most 0.065).","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75144414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Session details: Parallel and Fault Tolerant algorithms 会话细节:并行和容错算法
A. Butt
{"title":"Session details: Parallel and Fault Tolerant algorithms","authors":"A. Butt","doi":"10.1145/3257970","DOIUrl":"https://doi.org/10.1145/3257970","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72730304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation of Pattern Matching Workloads in Graph Analysis Systems 图分析系统中模式匹配工作量的评估
Seokyong Hong, S. Lee, Seung-Hwan Lim, S. Sukumar, Ranga Raju Vatsavai
{"title":"Evaluation of Pattern Matching Workloads in Graph Analysis Systems","authors":"Seokyong Hong, S. Lee, Seung-Hwan Lim, S. Sukumar, Ranga Raju Vatsavai","doi":"10.1145/2907294.2907305","DOIUrl":"https://doi.org/10.1145/2907294.2907305","url":null,"abstract":"Graph data management and mining became a popular area of research, and led to the development of plethora of systems in recent years. Unfortunately, a number of emerging graph analysis systems assume different graph data models, and support different query interface and serialization formats. Such diversity, combined with a lack of comparisons, makes it complicated to understand the trade-offs between different systems and the graph operations for which they are designed. This study presents an evaluation of graph pattern matching capabilities of six graph analysis systems, by extending the Lehigh University Benchmark to investigate the degree of effectiveness to perform the same operation over the same graph in various graph analysis systems. Through the evaluation, this study reveals both quantitative and qualitative findings.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72966245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信