A. Abdelfattah, H. Anzt, E. Boman, E. Carson, T. Cojean, J. Dongarra, Alyson Fox, M. Gates, N. Higham, X. Li, J. Loe, P. Luszczek, S. Pranesh, S. Rajamanickam, T. Ribizel, Barry Smith, K. Swirydowicz, Stephen J. Thomas, S. Tomov, Y. Tsai, U. Yang
{"title":"A survey of numerical linear algebra methods utilizing mixed-precision arithmetic","authors":"A. Abdelfattah, H. Anzt, E. Boman, E. Carson, T. Cojean, J. Dongarra, Alyson Fox, M. Gates, N. Higham, X. Li, J. Loe, P. Luszczek, S. Pranesh, S. Rajamanickam, T. Ribizel, Barry Smith, K. Swirydowicz, Stephen J. Thomas, S. Tomov, Y. Tsai, U. Yang","doi":"10.1177/10943420211003313","DOIUrl":"https://doi.org/10.1177/10943420211003313","url":null,"abstract":"The efficient utilization of mixed-precision numerical linear algebra algorithms can offer attractive acceleration to scientific computing applications. Especially with the hardware integration of low-precision special-function units designed for machine learning applications, the traditional numerical algorithms community urgently needs to reconsider the floating point formats used in the distinct operations to efficiently leverage the available compute power. In this work, we provide a comprehensive survey of mixed-precision numerical linear algebra routines, including the underlying concepts, theoretical background, and experimental results for both dense and sparse linear algebra problems.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"344 - 369"},"PeriodicalIF":3.1,"publicationDate":"2021-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211003313","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47255580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data-driven global weather predictions at high resolutions","authors":"John Taylor, P. Larraondo, B. D. de Supinski","doi":"10.1177/10943420211039818","DOIUrl":"https://doi.org/10.1177/10943420211039818","url":null,"abstract":"Society has benefited enormously from the continuous advancement in numerical weather prediction that has occurred over many decades driven by a combination of outstanding scientific, computational and technological breakthroughs. Here, we demonstrate that data-driven methods are now positioned to contribute to the next wave of major advances in atmospheric science. We show that data-driven models can predict important meteorological quantities of interest to society such as global high resolution precipitation fields (0.25°) and can deliver accurate forecasts of the future state of the atmosphere without prior knowledge of the laws of physics and chemistry. We also show how these data-driven methods can be scaled to run on supercomputers with up to 1024 modern graphics processing units and beyond resulting in rapid training of data-driven models, thus supporting a cycle of rapid research and innovation. Taken together, these two results illustrate the significant potential of data-driven methods to advance atmospheric science and operational weather forecasting.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"130 - 140"},"PeriodicalIF":3.1,"publicationDate":"2021-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46670592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerated execution via eager-release of dependencies in task-based workflows","authors":"Hatem Elshazly, F. Lordan, J. Ejarque, R. Badia","doi":"10.1177/1094342021997558","DOIUrl":"https://doi.org/10.1177/1094342021997558","url":null,"abstract":"Task-based programming models offer a flexible way to express the unstructured parallelism patterns of nowadays complex applications. This expressive capability is required to achieve maximum possible performance for applications that are executed in distributed execution platforms. In current task-based workflows, tasks are launched for execution when their data dependencies are satisfied. However, even though the data dependencies of a certain task might have been already produced, the execution of this task will be delayed until its predecessor tasks completely finish their execution. As a consequence of this approach of releasing dependencies, the amount of parallelism inherent in applications is limited and performance improvement opportunities are wasted. To mitigate this limitation, we propose an eager approach for releasing data dependencies. Following this approach, the execution of tasks will not be delayed until their predecessor tasks completely finish their execution, instead, tasks will be launched for execution as soon as their data requirements are available. Hence, more parallelism is exposed and applications can achieve higher levels of performance by overlapping the execution of tasks. Towards achieving this goal, in this paper we propose applying two changes to task-based workflow systems. First, modifying the dependency relationships of tasks to be specified not only in terms of predecessor and successor tasks but also in terms of the data that caused these dependencies. Second, triggering the release of dependencies as soon as a predecessor task generates the output data instead of having to wait until the end of the predecessor execution to release all of its dependencies. We realize this proposal using PyCOMPSs: a task-based programming model for parallelizing Python applications. Our experiments show that using an eager approach for releasing dependencies achieves more than 50% performance improvement in the total execution time as compared to the default approach of releasing dependencies.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"325 - 343"},"PeriodicalIF":3.1,"publicationDate":"2021-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094342021997558","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44714622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heather Pacella, Alec M. Dunton, A. Doostan, G. Iaccarino
{"title":"Task-parallel in situ temporal compression of large-scale computational fluid dynamics data","authors":"Heather Pacella, Alec M. Dunton, A. Doostan, G. Iaccarino","doi":"10.1177/10943420221085000","DOIUrl":"https://doi.org/10.1177/10943420221085000","url":null,"abstract":"Present day computational fluid dynamics (CFD) simulations generate considerable amounts of data, sometimes on the order of TB/s. Often, a significant fraction of this data is discarded because current storage systems are unable to keep pace. To address this, data compression algorithms can be applied to data arrays containing flow quantities of interest (QoIs) to reduce the overall required storage. The matrix column interpolative decomposition (ID) can be implemented as a type of lossy compression for data matrices that factors the original data matrix into a product of two smaller factor matrices. One of these matrices consists of a subset of the columns of the original data matrix, while the other is a coefficient matrix which approximates the original data matrix columns as linear combinations of the selected columns. Motivating this work is the observation that the structure of ID algorithms makes them well suited for the asynchronous nature of task-based parallelism; they can operate independently on subdomains of the system of interest and, as a result, provide varied levels of compression. Using the task-based Legion programming model, a single-pass ID algorithm (SPID) for CFD applications is implemented. Performance studies, scalability, and the accuracy of the compression algorithm are presented for a benchmark analytical Taylor-Green vortex problem, as well as large-scale implementations of both low and high Reynolds number (Re) compressible Taylor-Green vortices using a high-order Navier-Stokes solver. In the case of the analytical solution, the resulting compressed solution was rank-one, with error on the order of machine precision. For the low-Re vortex, compression factors between 1000 and 10,000 were achieved for errors in the range 10−2–10−3. Similar error values were seen for the high-Re vortex, this time with compression factors between 100 and 1000. Moreover, strong and weak scaling results demonstrate that introducing SPID to solvers leads to negligible increases in runtime.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"388 - 418"},"PeriodicalIF":3.1,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43904224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tommaso Benacchio, Luca Bonaventura, Mirco Altenbernd, C. Cantwell, P. Düben, M. Gillard, L. Giraud, Dominik Göddeke, E. Raffin, K. Teranishi, N. Wedi
{"title":"Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction","authors":"Tommaso Benacchio, Luca Bonaventura, Mirco Altenbernd, C. Cantwell, P. Düben, M. Gillard, L. Giraud, Dominik Göddeke, E. Raffin, K. Teranishi, N. Wedi","doi":"10.1177/1094342021990433","DOIUrl":"https://doi.org/10.1177/1094342021990433","url":null,"abstract":"Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"285 - 311"},"PeriodicalIF":3.1,"publicationDate":"2021-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094342021990433","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42265908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, Proceedings","authors":"","doi":"10.1007/978-3-030-78713-4","DOIUrl":"https://doi.org/10.1007/978-3-030-78713-4","url":null,"abstract":"","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"17 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91045977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Computing: 7th Latin American Conference, CARLA 2020, Cuenca, Ecuador, September 2–4, 2020, Revised Selected Papers","authors":"","doi":"10.1007/978-3-030-68035-0","DOIUrl":"https://doi.org/10.1007/978-3-030-68035-0","url":null,"abstract":"","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"46 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88440220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Point-block incomplete LU preconditioning with asynchronous iterations on GPU for multiphysics problems","authors":"Wenpeng Ma, X. Cai","doi":"10.1177/1094342020981153","DOIUrl":"https://doi.org/10.1177/1094342020981153","url":null,"abstract":"Point-block matrices arise naturally in multiphysics problems when all variables associated with a mesh point are ordered together, and are different from the general block matrices since the sizes of the blocks are so small one can often invert some of the diagonal blocks explicitly. Motivated by the recent works of Chow and Patel and Chow et al., we propose an efficient incomplete LU (ILU) preconditioner for point-block matrices targeting applications on GPU. The construction of the preconditioner involves two critical steps: (1) the initial guessing of values for the lower and upper triangular matrices; and (2) several sweeps of asynchronous updating of the triangular matrices. Three representative problems are studied to show the advantage of the proposed point-block approach over the standard point-wise approach in terms of the number of GMRES iterations and also the total compute time. Moreover, we compare the proposed algorithm with the level-scheduling based parallel algorithm employed in NVIDIA’s cuSPARSE library as well as the serial method implemented in Intel MKL library, and the experiments show that a 2×–5× speedup can be achieved over the block-based ILU(p) factorizations from the cuSPARSE library.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"121 - 135"},"PeriodicalIF":3.1,"publicationDate":"2020-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094342020981153","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44174581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Highly efficient lattice Boltzmann multiphase simulations of immiscible fluids at high-density ratios on CPUs and GPUs through code generation","authors":"M. Holzer, Martin Bauer, H. Köstler, U. Rüde","doi":"10.1177/10943420211016525","DOIUrl":"https://doi.org/10.1177/10943420211016525","url":null,"abstract":"A high-performance implementation of a multiphase lattice Boltzmann method based on the conservative Allen-Cahn model supporting high-density ratios and high Reynolds numbers is presented. Meta-programming techniques are used to generate optimized code for CPUs and GPUs automatically. The coupled model is specified in a high-level symbolic description and optimized through automatic transformations. The memory footprint of the resulting algorithm is reduced through the fusion of compute kernels. A roofline analysis demonstrates the excellent efficiency of the generated code on a single GPU. The resulting single GPU code has been integrated into the multiphysics framework waLBerla to run massively parallel simulations on large domains. Communication hiding and GPUDirect-enabled MPI yield near-perfect scaling behavior. Scaling experiments are conducted on the Piz Daint supercomputer with up to 2048 GPUs, simulating several hundred fully resolved bubbles. Further, validation of the implementation is shown in a physically relevant scenario—a three-dimensional rising air bubble in water.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"413 - 427"},"PeriodicalIF":3.1,"publicationDate":"2020-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211016525","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43072628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fine-grained parallelization of the immersed boundary method","authors":"A. Kassen, Varun Shankar, A. Fogelson","doi":"10.1177/10943420221083572","DOIUrl":"https://doi.org/10.1177/10943420221083572","url":null,"abstract":"We present new algorithms for the parallelization of Eulerian–Lagrangian interaction operations in the immersed boundary method. Our algorithms rely on two well-studied parallel primitives: key-value sort and segmented reduce. The use of these parallel primitives allows us to implement our algorithms on both graphics processing units (GPUs) and on other shared-memory architectures. We present strong and weak scaling tests on problems involving scattered points and elastic structures. Our tests show that our algorithms exhibit near-ideal scaling on both multicore CPUs and GPUs.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"443 - 458"},"PeriodicalIF":3.1,"publicationDate":"2020-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48107147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}