{"title":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","authors":"V. Alexandrov, A. Geist, J. Dongarra","doi":"10.1145/3148226","DOIUrl":"https://doi.org/10.1145/3148226","url":null,"abstract":"Novel scalable scientific algorithms are needed in order to enable key science applications to exploit the computational power of large-scale systems. This is especially true for the current tier of leading petascale machines and the road to exascale computing as HPC systems continue to scale up in compute node and processor core count. These extreme-scale systems require novel scientific algorithms to hide network and memory latency, have very high computation/communication overlap, have minimal communication, and have no synchronization points. With the advent of Big Data in the past few years the need of such scalable mathematical methods and algorithms able to handle data and compute intensive applications at scale becomes even more important. \u0000 \u0000Scientific algorithms for multi-petaflop and exa-flop systems also need to be fault tolerant and fault resilient, since the probability of faults increases with scale. Resilience at the system software and at the algorithmic level is needed as a crosscutting effort. Finally, with the advent of heterogeneous compute nodes that employ standard processors as well as GPGPUs, scientific algorithms need to match these architectures to extract the most performance. This includes different system-specific levels of parallelism as well as co-scheduling of computation. Key science applications require novel mathematics and mathematical models and system software that address the scalability and resilience challenges of current- and future-generation extreme-scale HPC systems. \u0000 \u0000The goal of this workshop is to bring together experts in the area of scalable algorithms to present the latest achievements and to discuss the challenges ahead.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116403653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Idomura, Takuya Ina, Akie Mayumi, S. Yamada, Kazuya Matsumoto, Y. Asahi, Toshiyuki Imamura
{"title":"Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms","authors":"Y. Idomura, Takuya Ina, Akie Mayumi, S. Yamada, Kazuya Matsumoto, Y. Asahi, Toshiyuki Imamura","doi":"10.1145/3148226.3148234","DOIUrl":"https://doi.org/10.1145/3148226.3148234","url":null,"abstract":"A communication-avoiding generalized minimal residual (CA-GMRES) method is applied to the gyrokinetic toroidal five dimensional Eulerian code GT5D, and its performance is compared against the original code with a generalized conjugate residual (GCR) method on the JAEA ICEX (Haswell), the Plasma Simulator (FX100), and the Oakforest-PACS (KNL). Although the CA-GMRES method dramatically reduces the number of data reduction communications, computation is largely increased compared with the GCR method. To resolve this issue, we propose a modified CA-GMRES method, which reduces both computation and memory access by ~ 30% with keeping the same CA property as the original CA-GMRES method. The modified CA-GMRES method has ~ 3.8X higher arithmetic intensity than the GCR method, and thus, is suitable for future Exa-scale architectures with limited memory and network bandwidths. The CA-GMRES solver is implemented using a hybrid CA approach, in which we apply CA to data reduction communications and use communication overlap for halo data communications, and is highly optimized for distributed caches on KNL. It is shown that compared with the GCR solver, its computing kernels are accelerated by 1.47X ~ 2.39X, and the cost of data reduction communication is reduced from 5% ~ 13% to ~ 1% of the total cost at 1,280 nodes.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131096460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Anzt, Gary Collins, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí
{"title":"Flexible batched sparse matrix-vector product on GPUs","authors":"H. Anzt, Gary Collins, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí","doi":"10.1145/3148226.3148230","DOIUrl":"https://doi.org/10.1145/3148226.3148230","url":null,"abstract":"We propose a variety of batched routines for concurrently processing a large collection of small-size, independent sparse matrix-vector products (SpMV) on graphics processing units (GPUs). These batched SpMV kernels are designed to be flexible in order to handle a batch of matrices which differ in size, nonzero count, and nonzero distribution. Furthermore, they support three most commonly used sparse storage formats: CSR, COO and ELL. Our experimental results on a state-of-the-art GPU reveal performance improvements of up to 25X compared to non-batched SpMV routines.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127895857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating half precision arithmetic to accelerate dense linear system solvers","authors":"A. Haidar, Panruo Wu, S. Tomov, J. Dongarra","doi":"10.1145/3148226.3148237","DOIUrl":"https://doi.org/10.1145/3148226.3148237","url":null,"abstract":"The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today's powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique - we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114975116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging NVLINK and asynchronous data transfer to scale beyond the memory capacity of GPUs","authors":"D. Appelhans, B. Walkup","doi":"10.1145/3148226.3148232","DOIUrl":"https://doi.org/10.1145/3148226.3148232","url":null,"abstract":"In this paper we demonstrate the utility of fast GPU to CPU interconnects to weak scale on hierarchical nodes without being limited to problem sizes that fit only in the GPU memory capacity. We show the speedup possible for a new regime of algorithms which traditionally have not benefited from being ported to GPUs because of an insufficient amount of computational work relative to bytes of data that must be transferred (offload intensity). This new capability is demonstrated with an example of our hierarchical GPU port of UMT, the 51K line CORAL benchmark application for Lawrence Livermore National Lab's radiation transport code. By overlapping data transfers and using the NVLINK connection between IBM POWER 8 CPUs and NVIDIA P100 GPUs, we demonstrate a speedup that continues even when scaling the problem size well beyond the memory capacity of the GPUs. Scaling to large local domains per MPI process is a necessary step to solving very large problems, and in the case of UMT, large local domains improve the convergence as the number of MPI ranks are weak scaled.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132630986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing the criticality of transient faults-induced SDCS on GPU applications","authors":"F. Santos, P. Rech","doi":"10.1145/3148226.3148228","DOIUrl":"https://doi.org/10.1145/3148226.3148228","url":null,"abstract":"In this paper we compare the soft-error sensitivity of parallel applications on modern Graphics Processing Units (GPUs) obtained through architectural-level fault injections and high-energy particle beam radiation experiments. Fault-injection and beam experiments provide different information and uses different transient-fault sensitivity metrics, which are hard to combine. In this paper we show how correlating beam and fault-injection data can provide a deeper understanding of the behavior of GPUs in the occurrence of transient faults. In particular, we demonstrate that commonly used architecture-level fault models (and fast injection tools) can be used to identify critical kernels and to associate some experimentally observed output errors with their causes. Additionally, we show how register file and instruction-level injections can be used to evaluate ECC efficiency in reducing the radiation-induced error rate.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123652656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Obersteiner, A. Parra-Hinojosa, M. Heene, H. Bungartz, D. Pflüger
{"title":"A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations","authors":"M. Obersteiner, A. Parra-Hinojosa, M. Heene, H. Bungartz, D. Pflüger","doi":"10.1145/3148226.3148229","DOIUrl":"https://doi.org/10.1145/3148226.3148229","url":null,"abstract":"With future exascale computers expected to have millions of compute units distributed among thousands of nodes, system faults are predicted to become more frequent. Fault tolerance will thus play a key role in HPC at this scale. In this paper we focus on solving the 5-dimensional gyrokinetic Vlasov-Maxwell equations using the application code GENE as it represents a high-dimensional and resource-intensive problem which is a natural candidate for exascale computing. We discuss the Fault-Tolerant Combination Technique, a resilient version of the Combination Technique, a method to increase the discretization resolution of existing PDE solvers. For the first time, we present an efficient, scalable and fault-tolerant implementation of this algorithm for plasma physics simulations based on a manager-worker model and test it under very realistic and pessimistic environments with simulated faults. We show that the Fault-Tolerant Combination Technique - an algorithm-based forward recovery method - can tolerate a large number of faults with a low overhead and at an acceptable loss in accuracy. Our parallel experiments with up to 32k cores show good scalability at a relative parallel efficiency of 93.61%. We conclude that algorithm-based solutions to fault tolerance are attractive for this type of problems.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129451962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic load balancing of massively parallel unstructured meshes","authors":"Gerrett Diamond, Cameron W. Smith, M. Shephard","doi":"10.1145/3148226.3148236","DOIUrl":"https://doi.org/10.1145/3148226.3148236","url":null,"abstract":"Simulating systems with evolving relational structures on massively parallel computers require the computational work to be evenly distributed across the processing resources throughout the simulation. Adaptive, unstructured, mesh-based finite element and finite volume tools best exemplify this need. We present EnGPar and its diffusive partition improvement method that accounts for multiple application specified criteria. EnGPar's performance is compared against its predecessor, ParMA. Specifically, partition improvement results are provided on up to 512Ki processes of the Argonne Leadership Computing Facility's Mira BlueGene/Q system.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115777927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic task discovery in PaRSEC: a data-flow task-based runtime","authors":"Reazul Hoque, T. Hérault, G. Bosilca, J. Dongarra","doi":"10.1145/3148226.3148233","DOIUrl":"https://doi.org/10.1145/3148226.3148233","url":null,"abstract":"Successfully exploiting distributed collections of heterogeneous many-cores architectures with complex memory hierarchy through a portable programming model is a challenge for application developers. The literature is not short of proposals addressing this problem, including many evolutionary solutions that seek to extend the capabilities of current message passing paradigms with intra-node features (MPI+X). A different, more revolutionary, solution explores data-flow task-based runtime systems as a substitute to both local and distributed data dependencies management. The solution explored in this paper, PaRSEC, is based on such a programming paradigm, supported by a highly efficient task-based runtime. This paper compares two programming paradigms present in PaRSEC, Parameterized Task Graph (PTG) and Dynamic Task Discovery (DTD) in terms of capabilities, overhead and potential benefits.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130413269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandre Fender, N. Emad, S. Petiton, Joe Eaton, M. Naumov
{"title":"Parallel jaccard and related graph clustering techniques","authors":"Alexandre Fender, N. Emad, S. Petiton, Joe Eaton, M. Naumov","doi":"10.1145/3148226.3148231","DOIUrl":"https://doi.org/10.1145/3148226.3148231","url":null,"abstract":"In this paper we propose to generalize Jaccard and related measures, often used as similarity coefficients between two sets. We define Jaccard, Dice-Sorensen and Tversky edge weights on a graph and generalize them to account for vertex weights. We develop an efficient parallel algorithm for computing Jaccard edge and PageRank vertex weights. We highlight that the weights computation can obtain more than 10X speedup on the GPU versus CPU on large realistic data sets. Also, we show that finding a minimum balanced cut for modified weights can be related to minimizing the sum of ratios of the intersection and union of nodes on the boundary of clusters. Finally, we show that the novel weights can improve the quality of the graph clustering by about 15% and 80% for multi-level and spectral graph partitioning and clustering schemes, respectively.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126728438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}