{"title":"Towards Fast Scalable Solvers for Charge Equilibration in Molecular Dynamics Applications","authors":"Kurt A. O'Hearn, H. Aktulga","doi":"10.1109/SCALA.2016.6","DOIUrl":"https://doi.org/10.1109/SCALA.2016.6","url":null,"abstract":"Including atom polarizability in molecular dynamics (MD) simulations is important for high-fidelity simulations. Solvers for charge models that are used to dynamically determine atom polarizations constitute significant bottlenecks in terms of time-to-solution and the overall scalability of polarizable and reactive force fields. The objective of this work is to improve the performance of the charge equilibration (QEq) method on shared memory architectures. A number of parallel incomplete LU-based preconditioning techniques are explored to enhance the performance of the Krylov subspace methods used in the QEq model. Detailed analysis of how these techniques effect convergence rate and the overall solver performance is presented. ILU-based schemes which produce good quality factors with relatively low number of nonzeros have been observed to yield significant speedups over the diagonal inverse baseline preconditioner. These results are significant as they can enable efficient simulations of moderate-sized systems on a single node with several cores, and also because they can constitute the future building blocks for distributed memory parallel solvers.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116071936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing PLASMA Eigensolver on Large Shared Memory Systems","authors":"Cheng Liao","doi":"10.1109/SCALA.2016.14","DOIUrl":"https://doi.org/10.1109/SCALA.2016.14","url":null,"abstract":"Performance of the PLASMA dense symmetric Eigensolver is optimized for large shared memory computer systems using multiple Householder domains for dense to band reduction and a communication reducing kernel for bulge chasing. The mr3-smp code by Petschow and Bientinesi is used for the tridiagonal eigensolution and the eigenvector back-transformations employ a 1D parallel decomposition. The input matrix, Householder vectors and scalars, are distributed among the CPU sockets with interleaved memory pages but the banded matrix, the eigenvectors, and temporary memory buffers are allocated and processed locally. Other considerations and optimization techniques also are presented. Numerical examples show the PLASMA eigensolver can out-perform ELPA and EIGENEXA significantly, for solving all the eigenpairs, if the problem size is sufficiently large, and the 2-stage eigensolution is generally better than its 1-stage counterpart on the latest x86_64 EP-4S CPUs with AVX2.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114633059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Akie Mayumi, Y. Idomura, Takuya Ina, S. Yamada, Toshiyuki Imamura
{"title":"Left-Preconditioned Communication-Avoiding Conjugate Gradient Methods for Multiphase CFD Simulations on the K Computer","authors":"Akie Mayumi, Y. Idomura, Takuya Ina, S. Yamada, Toshiyuki Imamura","doi":"10.1109/SCALA.2016.7","DOIUrl":"https://doi.org/10.1109/SCALA.2016.7","url":null,"abstract":"The left-preconditioned communication avoiding conjugate gradient (LP-CA-CG) method is applied to the pressure Poisson equation in the multiphase CFD code JUPITER. The arithmetic intensity of the LP-CA-CG method is analyzed, and is dramatically improved by loop splitting for inner product operations and for three term recurrence operations. Two LPCA-CG solvers with block Jacobi preconditioning and with underlap preconditioning are developed. The former is developed based on a hybrid CA approach, in which CA is applied only to global collective communications for inner product operations. The latter is a full CA approach, in which CA is applied also to local point-to-point communications in sparse matrix-vector (SpMV) operations and preconditioning. CA-SpMV requires additional computation for overlapping regions. CA-preconditiong is enabled by underlap preconditioning, which approximates preconditioning for overlapping regions by point Jacobi preconditioning. It is shown that on the K computer, the former is faster, because the performance of local point-to-point communications scales well, and the convergence property becomes worse with underlap preconditioning. The LP-CA-CG solver shows good strong scaling up to 30,000 nodes, where the LP-CA-CG solver achieved higher performance than the original CG solver by reducing the cost of global collective communications by 69 percent.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116507294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karla Morris, F. Rizzi, Brendan Cook, Paul Mycek, O. Maître, O. Knio, K. Sargsyan, K. Dahlgren, B. Debusschere
{"title":"Performance Scaling Variability and Energy Analysis for a Resilient ULFM-based PDE Solver","authors":"Karla Morris, F. Rizzi, Brendan Cook, Paul Mycek, O. Maître, O. Knio, K. Sargsyan, K. Dahlgren, B. Debusschere","doi":"10.1109/SCALA.2016.10","DOIUrl":"https://doi.org/10.1109/SCALA.2016.10","url":null,"abstract":"We present a resilient task-based domain-decomposition preconditioner for partial differential equations (PDEs) built on top of User Level Fault Mitigation Message Passing Interface (ULFM-MPI). The algorithm reformulates the PDE as a sampling problem, followed by a robust regression-based solution update that is resilient to silent data corruptions (SDCs). We adopt a server-client model where all state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM complement each other to support missing tasks, making the application resilient to clients failing.We present weak and strong scaling results on Edison, National Energy Research Scientific Computing Center (NERSC), for a nominal and a fault-injected case, showing that even in the presence of faults, scalability tested up to 50k cores is within 90%. We then quantify the variability of weak and strong scaling due to the presence of faults. Finally, we discuss the performance of our application with respect to subdomain size, server/client configuration, and the interplay between energy and resilience.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132361546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diego Davila, V. Alexandrov, Oscar A. Esquivel-Flores
{"title":"On Monte Carlo Hybrid Methods for Linear Algebra","authors":"Diego Davila, V. Alexandrov, Oscar A. Esquivel-Flores","doi":"10.1109/SCALA.2016.15","DOIUrl":"https://doi.org/10.1109/SCALA.2016.15","url":null,"abstract":"This paper presents an enhanced hybrid (e.g. stochastic/deterministic) method for Linear Algebra based on bulding an efficient stochastic s and then solving the corresponding System of Linear Algebraic Equations (SLAE) by applying an iterative method. This is a Monte Carlo preconditioner based on Markov Chain Monte Carlo (MCMC) methods to compute a rough approximate matrix inverse first. The above Monte Carlo preconditioner is further used to solve systems of linear algebraic equations thus delivering hybrid stochastic/deterministic algorithms. The advantage of the proposed approach is that the sparse Monte Carlo matrix inversion has a computational complexity linear of the size of the matrix, it is inherently parallel and thus can be obtained very efficiently for large matrices and can be used also as an efficient preconditioner while solving systems of linear algebraic equations. Several improvements, as well as the mixed MPI/OpenMP implementation, are carried out that enhance the scalability of the method and the efficient use of computational resources. A set of different test matrices from several matrix market collections were used to show the consistency of these improvements.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129163909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Batched Generation of Incomplete Sparse Approximate Inverses on GPUs","authors":"H. Anzt, Edmond Chow, T. Huckle, J. Dongarra","doi":"10.1109/SCALA.2016.11","DOIUrl":"https://doi.org/10.1109/SCALA.2016.11","url":null,"abstract":"Incomplete Sparse Approximate Inverses (ISAI) have recently been shown to be an attractive alternative to exact sparse triangular solves in the context of incomplete factorization preconditioning. In this paper we propose a batched GPU-kernel for the efficient generation of ISAI matrices. Utilizing only thread-local memory allows for computing the ISAI matrix with very small memory footprint. We demonstrate that this strategy is faster than the existing strategy for generating ISAI matrices, and use a large number of test matrices to assess the algorithm's efficiency in an iterative solver setting.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128231627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Randomized Sketching for Large-Scale Sparse Ridge Regression Problems","authors":"Chander Iyer, C. Carothers, P. Drineas","doi":"10.1109/SCALA.2016.13","DOIUrl":"https://doi.org/10.1109/SCALA.2016.13","url":null,"abstract":"We present a fast randomized ridge regression solver for sparse overdetermined matrices in distributed-memory platforms. Our solver is based on the Blendenpik algorithm, but employs sparse random projection schemes to construct a sketch of the input matrix. These sparse random projection sketching schemes, and in particular the use of the Randomized Sparsity-Preserving Transform, enable our algorithm to scale the distributed memory vanilla implementation of Blendenpik and provide up to × 13 speedup over a state-of-the-art parallel Cholesky-like sparse-direct solver.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"707 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115126031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Endong Wang, Shaohua Wu, Qing Zhang, Jun Liu, Wenlu Zhang, Zhihong Lin, Yutong Lu, Yunfei Du, Xiaoqian Zhu
{"title":"The Gyrokinetic Particle Simulation of Fusion Plasmas on Tianhe-2 Supercomputer","authors":"Endong Wang, Shaohua Wu, Qing Zhang, Jun Liu, Wenlu Zhang, Zhihong Lin, Yutong Lu, Yunfei Du, Xiaoqian Zhu","doi":"10.1109/SCALA.2016.8","DOIUrl":"https://doi.org/10.1109/SCALA.2016.8","url":null,"abstract":"We present novel optimizations of the fusion plasmas simulation code, GTC on Tianhe-2 supercomputer. The simulation exhibits excellent weak scalability up to 3072 31S1P Xeon Phi co-processors. An unprecedented up to 5.8× performance improvement is achieved for the GTC on Tianhe-2. An efficient particle exchanging algorithm is developed that simplifies the original iterative scheme to a direct implementation, which leads to a 7.9× performance improvement in terms of MPI communications on 1024 nodes of Tianhe-2. A customized particle sorting algorithm is presented that delivers a 2.0× performance improvement on the co-processor for the kernel relating to the particle computing. A smart offload algorithm that minimizes the data exchange between host and co-processor is introduced. Other optimizations like the loop fusion and vectorization are also presented.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121223971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effective Dynamic Load Balance using Space-Filling Curves for Large-Scale SPH Simulations on GPU-rich Supercomputers","authors":"Satori Tsuzuki, T. Aoki","doi":"10.1109/SCALA.2016.5","DOIUrl":"https://doi.org/10.1109/SCALA.2016.5","url":null,"abstract":"Billion of particles are required to describe fluid dynamics by using smoothed particle hydrodynamics (SPH), which computes short-range interactions among particles. In this study, we develop a novel code of large-scale SPH simulations on a multi-GPU platform by using the domain decomposition technique. The computational load of each decomposed domain is dynamically balanced by applying domain re-decomposition, which maintains the same number of particles in each decomposed domain. The performance scalability of the SPH simulation is examined on the GPUs of a TSUBAME 2.5 supercomputer by using two different techniques of dynamic load balance: the slice-grid method and the hierarchical domain decomposition method using the space-filling curve. The weak and strong scalabilities of a test case using 111 million particles are measured with 512 GPUs. In comparison with the slice-grid method, the performance keeps improving in proportion to the number of GPUs in the case of the space-filling curve. The Hilbert curve and the Peano curve show better performance scalabilities than the Morton curve in proportion to the increase in the number of GPUs.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122750530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zahra Khatami, Hartmut Kaiser, Patricia A. Grubel, Adrian Serio, J. Ramanujam
{"title":"A Massively Parallel Distributed N-body Application Implemented with HPX","authors":"Zahra Khatami, Hartmut Kaiser, Patricia A. Grubel, Adrian Serio, J. Ramanujam","doi":"10.1109/SCALA.2016.12","DOIUrl":"https://doi.org/10.1109/SCALA.2016.12","url":null,"abstract":"One of the major challenges in parallelization is the difficulty of improving application scalability with conventional techniques. HPX provides efficient scalable parallelism by significantly reducing node starvation and effective latencies while controlling the overheads. In this paper, we present a new highly scalable parallel distributed N-Body application using a future-based algorithm, which is implemented with HPX. The main difference between this algorithm and prior art is that a future-based request buffer is used between different nodes and along each spatial direction to send/receive data to/from the remote nodes, which helps removing synchronization barriers. HPX provides an asynchronous programming model which results in improving the parallel performance. The results of using HPX for parallelizing Octree construction on one node and the force computation on the distributed nodes show the scalability improvement on an average by about 45% compared to an equivalent OpenMP implementation and 28% compared to a hybrid implementation (MPI+OpenMP) [1] respectively for one billion particles running on up to 128 nodes with 20 cores per each.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126684461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}