S. Tseng, Bogdan Nicolae, F. Cappello, Aparna Chandramowlishwaran
{"title":"Demystifying asynchronous I/O Interference in HPC applications","authors":"S. Tseng, Bogdan Nicolae, F. Cappello, Aparna Chandramowlishwaran","doi":"10.1177/10943420211016511","DOIUrl":"https://doi.org/10.1177/10943420211016511","url":null,"abstract":"With increasing complexity of HPC workflows, data management services need to perform expensive I/O operations asynchronously in the background, aiming to overlap the I/O with the application runtime. However, this may cause interference due to competition for resources: CPU, memory/network bandwidth. The advent of multi-core architectures has exacerbated this problem, as many I/O operations are issued concurrently, thereby competing not only with the application but also among themselves. Furthermore, the interference patterns can dynamically change as a response to variations in application behavior and I/O subsystems (e.g. multiple users sharing a parallel file system). Without a thorough understanding, I/O operations may perform suboptimally, potentially even worse than in the blocking case. To fill this gap, this paper investigates the causes and consequences of interference due to asynchronous I/O on HPC systems. Specifically, we focus on multi-core CPUs and memory bandwidth, isolating the interference due to each resource. Then, we perform an in-depth study to explain the interplay and contention in a variety of resource sharing scenarios such as varying priority and number of background I/O threads and different I/O strategies: sendfile, read/write, mmap/write underlining trade-offs. The insights from this study are important both to enable guided optimizations of existing background I/O, as well as to open new opportunities to design advanced asynchronous I/O strategies.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"391 - 412"},"PeriodicalIF":3.1,"publicationDate":"2021-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211016511","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42639590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. A. Jacobs, Tim Moon, K. McLoughlin, Derek Jones, D. Hysom, Dong H. Ahn, J. Gyllenhaal, Pythagoras Watson, F. Lightstone, Jonathan E. Allen, I. Karlin, B. V. Van Essen
{"title":"Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative models","authors":"S. A. Jacobs, Tim Moon, K. McLoughlin, Derek Jones, D. Hysom, Dong H. Ahn, J. Gyllenhaal, Pythagoras Watson, F. Lightstone, Jonathan E. Allen, I. Karlin, B. V. Van Essen","doi":"10.1177/10943420211010930","DOIUrl":"https://doi.org/10.1177/10943420211010930","url":null,"abstract":"We improved the quality and reduced the time to produce machine learned models for use in small molecule antiviral design. Our globally asynchronous multi-level parallel training approach strong scales to all of Sierra with up to 97.7% efficiency. We trained a novel, character-based Wasserstein autoencoder that produces a higher quality model trained on 1.613 billion compounds in 23 minutes while the previous state of the art takes a day on 1 million compounds. Reducing training time from a day to minutes shifts the model creation bottleneck from computer job turnaround time to human innovation time. Our implementation achieves 318 PFLOPs for 17.1% of half-precision peak. We will incorporate this model into our molecular design loop enabling the generation of more diverse compounds; searching for novel, candidate antiviral drugs improves and reduces the time to synthesize compounds to be tested in the lab.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"469 - 482"},"PeriodicalIF":3.1,"publicationDate":"2021-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211010930","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48620430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Selecting optimal SpMV realizations for GPUs via machine learning","authors":"E. Dufrechou, P. Ezzatti, E. S. Quintana‐Ortí","doi":"10.1177/1094342021990738","DOIUrl":"https://doi.org/10.1177/1094342021990738","url":null,"abstract":"More than 10 years of research related to the development of efficient GPU routines for the sparse matrix-vector product (SpMV) have led to several realizations, each with its own strengths and weaknesses. In this work, we review some of the most relevant efforts on the subject, evaluate a few prominent routines that are publicly available using more than 3000 matrices from different applications, and apply machine learning techniques to anticipate which SpMV realization will perform best for each sparse matrix on a given parallel platform. Our numerical experiments confirm the methods offer such varied behaviors depending on the matrix structure that the identification of general rules to select the optimal method for a given matrix becomes extremely difficult, though some useful strategies (heuristics) can be defined. Using a machine learning approach, we show that it is possible to obtain unexpensive classifiers that predict the best method for a given sparse matrix with over 80% accuracy, demonstrating that this approach can deliver important reductions in both execution time and energy consumption.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"254 - 267"},"PeriodicalIF":3.1,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094342021990738","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47422492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Issue related to the Power-Aware Computing Workshop 2019—PACO 2019","authors":"P. Benner, E. S. Quintana‐Ortí, J. Saak","doi":"10.1177/10943420211008791","DOIUrl":"https://doi.org/10.1177/10943420211008791","url":null,"abstract":"Power-awareness in high-performance scientific computing has gained increased interest due to its non-negligible contributions to carbon-dioxide emissions and thus, to one of the main drivers of anthropocenic climate change. This is for instance recognized and popularized by the Green500 list, which ranks the supercomputers from the TOP500 list in terms of energy efficiency by measuring performance per Watt. In a joint project, funded 2015–2016 by the German Ministry of Education and Research (BMBF), the Max Planck Institute for Dynamics of Complex Technical Systems in Magdeburg (Germany) and the Universidad de la República in Montevideo (Uruguay) investigated numerical linear algebra algorithms for applications in systems and control theory with respect to power consumption and energy efficiency. As part of this effort, a first workshop on “Power-Aware Computing (PACO 2015)” was held in Magdeburg, July 6–7, 2015. The follow-up workshop PACO 2017 took place July 5–8, 2017, at Ringberg Castle in the south of Bavaria (Germany). PACO 2019, held November 5–6, 2019, again in Magdeburg, was the third instance in this series of workshops, and this special issue is dedicated to research results presented at this workshop. The aims and scope of the PACO workshops comprise developments in power or energy savings in computational systems. The interests include, but are not strictly limited to:","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"209 - 210"},"PeriodicalIF":3.1,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211008791","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47328749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Kondratyuk, V. Nikolskiy, D. Pavlov, V. Stegailov
{"title":"GPU-accelerated molecular dynamics: State-of-art software performance and porting from Nvidia CUDA to AMD HIP","authors":"N. Kondratyuk, V. Nikolskiy, D. Pavlov, V. Stegailov","doi":"10.1177/10943420211008288","DOIUrl":"https://doi.org/10.1177/10943420211008288","url":null,"abstract":"Classical molecular dynamics (MD) calculations represent a significant part of the utilization time of high-performance computing systems. As usual, the efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed open-source MD codes focused on GPUs differ both in their data management capabilities and in performance. In this work, we analyze the performance of LAMMPS, GROMACS and OpenMM MD packages with different GPU backends on Nvidia Volta and AMD Vega20 GPUs. We consider the efficiency of solving two identical MD models (generic for material science and biomolecular studies) using different software and hardware combinations. We describe our experience in porting the CUDA backend of LAMMPS to ROCm HIP that shows considerable benefits for AMD GPUs comparatively to the OpenCL backend.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"312 - 324"},"PeriodicalIF":3.1,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211008288","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47645398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jordan Musser, A. Almgren, W. Fullmer, O. Antepara, J. Bell, Johannes Blaschke, K. Gott, A. Myers, R. Porcù, Deepak Rangarajan, M. Rosso, Weiqun Zhang, M. Syamlal
{"title":"MFIX-Exa: A path toward exascale CFD-DEM simulations","authors":"Jordan Musser, A. Almgren, W. Fullmer, O. Antepara, J. Bell, Johannes Blaschke, K. Gott, A. Myers, R. Porcù, Deepak Rangarajan, M. Rosso, Weiqun Zhang, M. Syamlal","doi":"10.1177/10943420211009293","DOIUrl":"https://doi.org/10.1177/10943420211009293","url":null,"abstract":"MFIX-Exa is a computational fluid dynamics–discrete element model (CFD-DEM) code designed to run efficiently on current and next-generation supercomputing architectures. MFIX-Exa combines the CFD-DEM expertise embodied in the MFIX code—which was developed at NETL and is used widely in academia and industry—with the modern software framework, AMReX, developed at LBNL. The fundamental physics models follow those of the original MFIX, but the combination of new algorithmic approaches and a new software infrastructure will enable MFIX-Exa to leverage future exascale machines to optimize the modeling and design of multiphase chemical reactors.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"40 - 58"},"PeriodicalIF":3.1,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211009293","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46986585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A runtime based comparison of highly tuned lattice Boltzmann and finite difference solvers","authors":"K. Wichmann, M. Kronbichler, R. Löhner, W. Wall","doi":"10.1177/10943420211006169","DOIUrl":"https://doi.org/10.1177/10943420211006169","url":null,"abstract":"The aim of this work is a fair and unbiased comparison of a lattice Boltzmann method (LBM) against a finite difference method (FDM) for the simulation of fluid flows. Rather than reporting metrics such as floating point operation rates or memory throughput, our work considers the engineering quest of reaching a desired solution quality with the least computational effort. The specific lattice Boltzmann and finite difference methods selected here are of a very basic nature to emphasize the influence of the fundamentally different approaches. To minimize the skew in the measurements, complex boundary condition schemes and further advanced techniques are avoided and instead both methods are fully explicit, weakly compressible approaches. Due to the highly optimized nature of both codes, different sets of restrictions are imposed by either method. Using the common set of features, two relatively simple test cases in terms of a duct flow and the flow in a lid driven cavity are considered and are tuned to perform optimally with both approaches. As a third test case, a transient flow around a square cylinder is used to demonstrate the applicability to engineering oriented settings and in a temporal domain. The performance of the two methods is found to be very similar with no full advantage for any of the approaches. Overall a tendency toward better performance of the LBM at larger target errors and for indirect benchmark quantities, such as lift and drag, is observed, while the FDM excels at smaller target errors and direct comparisons of velocity and pressure profiles to analytical solutions. Other factors such as the difficulty of setting consistent boundary conditions in the LBM or the effect of stabilization in the FDM are likely to be the most important criteria when searching for a very fast flow solver for practical applications.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"370 - 390"},"PeriodicalIF":3.1,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211006169","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41854300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Dünnebacke, S. Turek, C. Lohmann, A. Sokolov, P. Zajác
{"title":"Increased space-parallelism via time-simultaneous Newton-multigrid methods for nonstationary nonlinear PDE problems","authors":"J. Dünnebacke, S. Turek, C. Lohmann, A. Sokolov, P. Zajác","doi":"10.1177/10943420211001940","DOIUrl":"https://doi.org/10.1177/10943420211001940","url":null,"abstract":"We discuss how “parallel-in-space & simultaneous-in-time” Newton-multigrid approaches can be designed which improve the scaling behavior of the spatial parallelism by reducing the latency costs. The idea is to solve many time steps at once and therefore solving fewer but larger systems. These large systems are reordered and interpreted as a space-only problem leading to multigrid algorithm with semi-coarsening in space and line smoothing in time direction. The smoother is further improved by embedding it as a preconditioner in a Krylov subspace method. As a prototypical application, we concentrate on scalar partial differential equations (PDEs) with up to many thousands of time steps which are discretized in time, resp., space by finite difference, resp., finite element methods. For linear PDEs, the resulting method is closely related to multigrid waveform relaxation and its theoretical framework. In our parabolic test problems the numerical behavior of this multigrid approach is robust w.r.t. the spatial and temporal grid size and the number of simultaneously treated time steps. Moreover, we illustrate how corresponding time-simultaneous fixed-point and Newton-type solvers can be derived for nonlinear nonstationary problems that require the described solution of linearized problems in each outer nonlinear step. As the main result, we are able to generate much larger problem sizes to be treated by a large number of cores so that the combination of the robustly scaling multigrid solvers together with a larger degree of parallelism allows a faster solution procedure for nonstationary problems.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"211 - 225"},"PeriodicalIF":3.1,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211001940","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48273199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Glaser, J. Vermaas, D. Rogers, J. Larkin, S. Legrand, Swen Boehm, Matthew B. Baker, A. Scheinberg, A. F. Tillack, M. Thavappiragasam, A. Sedova, Oscar R. Hernandez
{"title":"High-throughput virtual laboratory for drug discovery using massive datasets","authors":"J. Glaser, J. Vermaas, D. Rogers, J. Larkin, S. Legrand, Swen Boehm, Matthew B. Baker, A. Scheinberg, A. F. Tillack, M. Thavappiragasam, A. Sedova, Oscar R. Hernandez","doi":"10.1177/10943420211001565","DOIUrl":"https://doi.org/10.1177/10943420211001565","url":null,"abstract":"Time-to-solution for structure-based screening of massive chemical databases for COVID-19 drug discovery has been decreased by an order of magnitude, and a virtual laboratory has been deployed at scale on up to 27,612 GPUs on the Summit supercomputer, allowing an average molecular docking of 19,028 compounds per second. Over one billion compounds were docked to two SARS-CoV-2 protein structures with full optimization of ligand position and 20 poses per docking, each in under 24 hours. GPU acceleration and high-throughput optimizations of the docking program produced 350× mean speedup over the CPU version (50× speedup per node). GPU acceleration of both feature calculation for machine-learning based scoring and distributed database queries reduced processing of the 2.4 TB output by orders of magnitude. The resulting 50× speedup for the full pipeline reduces an initial 43 day runtime to 21 hours per protein for providing high-scoring compounds to experimental collaborators for validation assays.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"452 - 468"},"PeriodicalIF":3.1,"publicationDate":"2021-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/10943420211001565","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44353870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A massively parallel time-domain coupled electrodynamics–micromagnetics solver","authors":"Z. Yao, R. Jambunathan, Yadong Zeng, A. Nonaka","doi":"10.1177/10943420211057906","DOIUrl":"https://doi.org/10.1177/10943420211057906","url":null,"abstract":"We present a high-performance coupled electrodynamics–micromagnetics solver for full physical modeling of signals in microelectronic circuitry. The overall strategy couples a finite-difference time-domain approach for Maxwell’s equations to a magnetization model described by the Landau–Lifshitz–Gilbert equation. The algorithm is implemented in the Exascale Computing Project software framework, AMReX, which provides effective scalability on manycore and GPU-based supercomputing architectures. Furthermore, the code leverages ongoing developments of the Exascale Application Code, WarpX, which is primarily being developed for plasma wakefield accelerator modeling. Our temporal coupling scheme provides second-order accuracy in space and time by combining the integration steps for the magnetic field and magnetization into an iterative sub-step that includes a trapezoidal temporal discretization for the magnetization. The performance of the algorithm is demonstrated by the excellent scaling results on NERSC multicore and GPU systems, with a significant (59×) speedup on the GPU using a node-by-node comparison. We demonstrate the utility of our code by performing simulations of an electromagnetic waveguide and a magnetically tunable filter.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"167 - 181"},"PeriodicalIF":3.1,"publicationDate":"2021-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43687761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}