Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems最新文献

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems 第八届大规模系统可扩展算法最新进展研讨会论文集

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI: 10.1145/3148226

V. Alexandrov, A. Geist, J. Dongarra

{"title":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","authors":"V. Alexandrov, A. Geist, J. Dongarra","doi":"10.1145/3148226","DOIUrl":"https://doi.org/10.1145/3148226","url":null,"abstract":"Novel scalable scientific algorithms are needed in order to enable key science applications to exploit the computational power of large-scale systems. This is especially true for the current tier of leading petascale machines and the road to exascale computing as HPC systems continue to scale up in compute node and processor core count. These extreme-scale systems require novel scientific algorithms to hide network and memory latency, have very high computation/communication overlap, have minimal communication, and have no synchronization points. With the advent of Big Data in the past few years the need of such scalable mathematical methods and algorithms able to handle data and compute intensive applications at scale becomes even more important. \u0000 \u0000Scientific algorithms for multi-petaflop and exa-flop systems also need to be fault tolerant and fault resilient, since the probability of faults increases with scale. Resilience at the system software and at the algorithmic level is needed as a crosscutting effort. Finally, with the advent of heterogeneous compute nodes that employ standard processors as well as GPGPUs, scientific algorithms need to match these architectures to extract the most performance. This includes different system-specific levels of parallelism as well as co-scheduling of computation. Key science applications require novel mathematics and mathematical models and system software that address the scalability and resilience challenges of current- and future-generation extreme-scale HPC systems. \u0000 \u0000The goal of this workshop is to bring together experts in the area of scalable algorithms to present the latest achievements and to discuss the challenges ahead.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116403653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms 避免通信的广义最小残差法在多核心平台上的陀螺动力学五维欧拉码中的应用

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI: 10.1145/3148226.3148234

Y. Idomura, Takuya Ina, Akie Mayumi, S. Yamada, Kazuya Matsumoto, Y. Asahi, Toshiyuki Imamura

{"title":"Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms","authors":"Y. Idomura, Takuya Ina, Akie Mayumi, S. Yamada, Kazuya Matsumoto, Y. Asahi, Toshiyuki Imamura","doi":"10.1145/3148226.3148234","DOIUrl":"https://doi.org/10.1145/3148226.3148234","url":null,"abstract":"A communication-avoiding generalized minimal residual (CA-GMRES) method is applied to the gyrokinetic toroidal five dimensional Eulerian code GT5D, and its performance is compared against the original code with a generalized conjugate residual (GCR) method on the JAEA ICEX (Haswell), the Plasma Simulator (FX100), and the Oakforest-PACS (KNL). Although the CA-GMRES method dramatically reduces the number of data reduction communications, computation is largely increased compared with the GCR method. To resolve this issue, we propose a modified CA-GMRES method, which reduces both computation and memory access by ~ 30% with keeping the same CA property as the original CA-GMRES method. The modified CA-GMRES method has ~ 3.8X higher arithmetic intensity than the GCR method, and thus, is suitable for future Exa-scale architectures with limited memory and network bandwidths. The CA-GMRES solver is implemented using a hybrid CA approach, in which we apply CA to data reduction communications and use communication overlap for halo data communications, and is highly optimized for distributed caches on KNL. It is shown that compared with the GCR solver, its computing kernels are accelerated by 1.47X ~ 2.39X, and the cost of data reduction communication is reduced from 5% ~ 13% to ~ 1% of the total cost at 1,280 nodes.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131096460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Flexible batched sparse matrix-vector product on GPUs gpu上的柔性批处理稀疏矩阵向量积

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI: 10.1145/3148226.3148230

H. Anzt, Gary Collins, J. Dongarra, Goran Flegar, E. S. Quintana‐Ortí

引用次数: 4

Investigating half precision arithmetic to accelerate dense linear system solvers 研究半精度算法加速密集线性系统求解

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI: 10.1145/3148226.3148237

A. Haidar, Panruo Wu, S. Tomov, J. Dongarra

{"title":"Investigating half precision arithmetic to accelerate dense linear system solvers","authors":"A. Haidar, Panruo Wu, S. Tomov, J. Dongarra","doi":"10.1145/3148226.3148237","DOIUrl":"https://doi.org/10.1145/3148226.3148237","url":null,"abstract":"The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today's powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique - we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114975116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Leveraging NVLINK and asynchronous data transfer to scale beyond the memory capacity of GPUs 利用NVLINK和异步数据传输来扩展gpu的内存容量

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI: 10.1145/3148226.3148232

D. Appelhans, B. Walkup

引用次数: 1

Analyzing the criticality of transient faults-induced SDCS on GPU applications GPU应用中暂态故障诱发SDCS的临界性分析

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI: 10.1145/3148226.3148228

F. Santos, P. Rech

引用次数: 7

A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations 一个高度可扩展的，基于算法的容错解算器，用于回旋动力等离子体模拟

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI: 10.1145/3148226.3148229

M. Obersteiner, A. Parra-Hinojosa, M. Heene, H. Bungartz, D. Pflüger

{"title":"A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations","authors":"M. Obersteiner, A. Parra-Hinojosa, M. Heene, H. Bungartz, D. Pflüger","doi":"10.1145/3148226.3148229","DOIUrl":"https://doi.org/10.1145/3148226.3148229","url":null,"abstract":"With future exascale computers expected to have millions of compute units distributed among thousands of nodes, system faults are predicted to become more frequent. Fault tolerance will thus play a key role in HPC at this scale. In this paper we focus on solving the 5-dimensional gyrokinetic Vlasov-Maxwell equations using the application code GENE as it represents a high-dimensional and resource-intensive problem which is a natural candidate for exascale computing. We discuss the Fault-Tolerant Combination Technique, a resilient version of the Combination Technique, a method to increase the discretization resolution of existing PDE solvers. For the first time, we present an efficient, scalable and fault-tolerant implementation of this algorithm for plasma physics simulations based on a manager-worker model and test it under very realistic and pessimistic environments with simulated faults. We show that the Fault-Tolerant Combination Technique - an algorithm-based forward recovery method - can tolerate a large number of faults with a low overhead and at an acceptable loss in accuracy. Our parallel experiments with up to 32k cores show good scalability at a relative parallel efficiency of 93.61%. We conclude that algorithm-based solutions to fault tolerance are attractive for this type of problems.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129451962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Dynamic load balancing of massively parallel unstructured meshes 大规模并行非结构化网格的动态负载平衡

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI: 10.1145/3148226.3148236

Gerrett Diamond, Cameron W. Smith, M. Shephard

引用次数: 4

Dynamic task discovery in PaRSEC: a data-flow task-based runtime PaRSEC中的动态任务发现:基于数据流任务的运行时

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI: 10.1145/3148226.3148233

Reazul Hoque, T. Hérault, G. Bosilca, J. Dongarra

引用次数: 53

Parallel jaccard and related graph clustering techniques 并行jard和相关的图聚类技术

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI: 10.1145/3148226.3148231

Alexandre Fender, N. Emad, S. Petiton, Joe Eaton, M. Naumov

引用次数: 7