{"title":"Acceleration of Large-Scale Electronic Structure Simulations with Heterogeneous Parallel Computing","authors":"Oh-Kyoung Kwon, H. Ryu","doi":"10.5772/INTECHOPEN.80997","DOIUrl":"https://doi.org/10.5772/INTECHOPEN.80997","url":null,"abstract":"Large-scale electronic structure simulations coupled to an empirical modeling approach are critical as they present a robust way to predict various quantum phe-nomena in realistically sized nanoscale structures that are hard to be handled with density functional theory. For tight-binding (TB) simulations of electronic structures that normally involve multimillion atomic systems for a direct comparison to experimentally realizable nanoscale materials and devices, we show that graphical processing unit (GPU) devices help in saving computing costs in terms of time and energy consumption. With a short introduction of the major numerical method adopted for TB simulations of electronic structures, this work presents a detailed description for the strategies to drive performance enhancement with GPU devices against traditional clusters of multicore processors. While this work only uses TB electronic structure simulations for benchmark tests, it can be also utilized as a practical guideline to enhance performance of numerical operations that involve large-scale sparse matrices.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"36 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91479822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dingwen Tao, S. Song, S. Krishnamoorthy, Panruo Wu, Xin Liang, E. Zhang, D. Kerbyson, Zizhong Chen
{"title":"New-Sum: A Novel Online ABFT Scheme For General Iterative Methods","authors":"Dingwen Tao, S. Song, S. Krishnamoorthy, Panruo Wu, Xin Liang, E. Zhang, D. Kerbyson, Zizhong Chen","doi":"10.1145/2907294.2907306","DOIUrl":"https://doi.org/10.1145/2907294.2907306","url":null,"abstract":"Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73545348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicholas Chaimov, A. Malony, S. Canon, Costin Iancu, K. Ibrahim, Jayanth Srinivasan
{"title":"Scaling Spark on HPC Systems","authors":"Nicholas Chaimov, A. Malony, S. Canon, Costin Iancu, K. Ibrahim, Jayanth Srinivasan","doi":"10.1145/2907294.2907310","DOIUrl":"https://doi.org/10.1145/2907294.2907310","url":null,"abstract":"We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8x. On the hardware side we evaluate a system with a large NVRAM buffer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(10^4). As our analysis indicates, it is feasible to observe much higher scalability in the near future.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80865722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BAShuffler: Maximizing Network Bandwidth Utilization in the Shuffle of YARN","authors":"Feng Liang, F. Lau","doi":"10.1145/2907294.2907296","DOIUrl":"https://doi.org/10.1145/2907294.2907296","url":null,"abstract":"YARN is a popular cluster resource management platform. It does not, however, manage the network bandwidth resources which can significantly affect the execution performance of those tasks having large volumes of data to transfer within the cluster. The shuffle phase of MapReduce jobs features many such tasks. The impact of under utilization of the network bandwidth in shuffle tasks is more pronounced if the network bandwidth capacities of the nodes in the cluster are varied. We present BAShuffler, a bandwidth-aware shuffle scheduler, that can maximize the overall network bandwidth utilization by scheduling the source nodes of the fetch flows at the application level. BAShuffler can fully utilize the network bandwidth capacity in a max-min fair network. The experimental results for a variety of realistic benchmarks show that BAShuffler can substantially improve the cluster's shuffle throughput and reduce the execution time of shuffle tasks as compared to the original YARN, especially in heterogeneous network bandwidth environments.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88893093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","authors":"H. Nakashima, K. Taura, Jack Lange","doi":"10.1145/2907294","DOIUrl":"https://doi.org/10.1145/2907294","url":null,"abstract":"Welcome to the 25th ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'16). HPDC'16 follows the tradition of previous versions of the conference by providing a high-quality, single-track forum for presenting new research results on all aspects of the design, implementation, evaluation, and application of parallel and distributed systems for high-end computing. The HPDC'16 program features eight sessions that cover wide range of topics including high performance networking, parallel algorithms, algorithm-based fault tolerance, big data processing, I/O optimizations, non-volatile memory, cloud, resource management, many core systems, GPUs, graph processing algorithms, and more. In these sessions, not only full papers but also short papers are presented to give a mix of novel research directions at various stages of development, which also is exhibited by a number of posters. This program is complemented by an interesting set of six workshops, FTXS, HPGP, SEM4HPC, DIDC, ROSS and ScienceCloud, on a range of timely and related systems and application topics. \u0000 \u0000The conference program also features three keynote/invited talks given by Dr. Jeffrey Vetter of Oak Ridge National Laboratory, Professor Jack Dongarra of University of Tennessee, and Professor Ada Gavrilovska of Georgia Tech to memorialize the late Professor Karsten Schwan of Georgia Tech. \u0000 \u0000Jack Dongarra is the recipient of the 5th HPDC Annual Achievement Award. The purpose of this award is to recognize individuals who have made long lasting, influential contributions to the foundations or practice of the field of high-performance parallel and distributed computing, to raise the awareness of these contributions, especially among the younger generation of researchers, and to improve the image and the public relations of the HPDC community. The Award Selection Committee followed the formalized process established in 2013 to select the winner with an open call for nominations. \u0000 \u0000The HPDC'16 call for papers attracted 129 paper submissions. In the review process this year, we followed two established methods that were started in 2012: a two-round review process and an author rebuttal process. In the first round review, all papers received at least three reviews, and based on these reviews, 71 papers went on to the second round in which most of them received another two reviews. In total, 514 reviews were generated by the 54-member Program Committee along with a number of external reviewers. For many of the 71 second-round papers, the authors submitted rebuttals. Rebuttals were carefully taken into consideration during the Program Committee deliberations as part of the selection process. On March 10-11, the Program Committee met at University of Pittsburgh (Pittsburgh, PA) and made the final selection. Each paper in the second round of reviews was discussed at the meeting. At the end of the 1.5-day meeting, the Program Committee accepted 20 full papers, resulting in an acce","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87384861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhishek Kulkarni, Luke Dalessandro, E. Kissel, A. Lumsdaine, T. Sterling, M. Swany
{"title":"Network-Managed Virtual Global Address Space for Message-driven Runtimes","authors":"Abhishek Kulkarni, Luke Dalessandro, E. Kissel, A. Lumsdaine, T. Sterling, M. Swany","doi":"10.1145/2907294.2907320","DOIUrl":"https://doi.org/10.1145/2907294.2907320","url":null,"abstract":"Maintaining a scalable high-performance virtual global address space using distributed memory hardware has proven to be challenging. In this paper we evaluate a new approach for such an active global address space that leverages the capabilities of the network fabric to manage addressing, rather than software at the endpoint hosts. We describe our overall approach, design alternatives, and present initial experimental results that demonstrate the effectiveness and limitations of existing network hardware.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86659913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Performance Distributed RMA Locks","authors":"P. Schmid, Maciej Besta, T. Hoefler","doi":"10.1145/2907294.2907323","DOIUrl":"https://doi.org/10.1145/2907294.2907323","url":null,"abstract":"We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81% and 73%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89445110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Processing of Large Graphs via Input Reduction","authors":"Amlan Kusum, Keval Vora, Rajiv Gupta, Iulian Neamtiu","doi":"10.1145/2907294.2907312","DOIUrl":"https://doi.org/10.1145/2907294.2907312","url":null,"abstract":"Large-scale parallel graph analytics involves executing iterative algorithms (e.g., PageRank, Shortest Paths, etc.) that are both data- and compute-intensive. In this work we construct faster versions of iterative graph algorithms from their original counterparts using input graph reduction. A large input graph is transformed into a small graph using a sequence of input reduction transformations. Savings in execution time are achieved using our two phased processing model that effectively runs the original iterative algorithm in two phases: first, using the reduced input graph to gain savings in execution time; and second, using the original input graph along with the results from the first phase for computing precise results. We propose several input reduction transformations and identify the structural and non-structural properties that they guarantee, which in turn are used to ensure the correctness of results while using our two phased processing model. We further present a unified input reduction algorithm that efficiently applies a non-interfering sequence of simple local input reduction transformations. Our experiments show that our transformation techniques enable significant reductions in execution time (1.25x-2.14x) while achieving precise final results for most of the algorithms. For cases where precise results cannot be achieved, the relative error remains very small (at most 0.065).","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75144414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Parallel and Fault Tolerant algorithms","authors":"A. Butt","doi":"10.1145/3257970","DOIUrl":"https://doi.org/10.1145/3257970","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72730304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seokyong Hong, S. Lee, Seung-Hwan Lim, S. Sukumar, Ranga Raju Vatsavai
{"title":"Evaluation of Pattern Matching Workloads in Graph Analysis Systems","authors":"Seokyong Hong, S. Lee, Seung-Hwan Lim, S. Sukumar, Ranga Raju Vatsavai","doi":"10.1145/2907294.2907305","DOIUrl":"https://doi.org/10.1145/2907294.2907305","url":null,"abstract":"Graph data management and mining became a popular area of research, and led to the development of plethora of systems in recent years. Unfortunately, a number of emerging graph analysis systems assume different graph data models, and support different query interface and serialization formats. Such diversity, combined with a lack of comparisons, makes it complicated to understand the trade-offs between different systems and the graph operations for which they are designed. This study presents an evaluation of graph pattern matching capabilities of six graph analysis systems, by extending the Lehigh University Benchmark to investigate the degree of effectiveness to perform the same operation over the same graph in various graph analysis systems. Through the evaluation, this study reveals both quantitative and qualitative findings.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72966245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}