{"title":"Graph Contractions for Calculating Correlation Functions in Lattice QCD","authors":"Jing Chen, R. Edwards, W. Mao","doi":"10.1145/3592979.3593409","DOIUrl":"https://doi.org/10.1145/3592979.3593409","url":null,"abstract":"Computing correlation functions for many-particle systems in Lattice QCD is vital to extract nuclear physics observables like the energy spectrum of hadrons such as protons. However, this type of calculation has long been considered to be very challenging and computing-resource intensive because of the complex nature of a hadron composed of quarks with many degrees of freedom. In particular, a correlation function can be calculated through a sum of all possible pairs of quark contractions, each of which is a batched tensor contraction, dictated by Wick's theorem. Because the number of terms of this sum can be very large for any hadronic system of interest, fast evaluation of the sum faces several challenges: an extremely large number of contractions, a huge memory footprint at runtime, and the speed of tensor contractions. In this paper, we present a Lattice QCD analysis software suite, Redstar, which addresses these challenges by utilizing novel algorithmic and software engineering methods targeting modern computing platforms such as many-core CPUs and GPUs. In particular, Redstar represents every term in the sum of a correlation function by a graph, applies efficient graph algorithms to reduce the number of contractions to lower the cost of computations, and minimizes the total memory footprint. Moreover, Redstar carries out the contractions on either CPUs or GPUs utilizing an internal and highly efficient Hadron contraction library Specifically, we illustrate some important algorithmic optimizations of Redstar, show various key design features of Hadron library, and present the speedup values due to the optimizations along with performance figures for calculating six correlations functions on four computing platforms.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122040303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Azmi, J. Meyer, M. Strobl, M. Weimer, Achim Streit
{"title":"Approximation and Optimization of Global Environmental Simulations with Neural Networks","authors":"E. Azmi, J. Meyer, M. Strobl, M. Weimer, Achim Streit","doi":"10.1145/3592979.3593418","DOIUrl":"https://doi.org/10.1145/3592979.3593418","url":null,"abstract":"Solving a system of hundreds of chemical differential equations in environmental simulations has a major computational complexity, and thereby requires high performance computing resources, which is a challenge as the spatio-temporal resolution increases. Machine learning methods and specially deep learning can offer an approximation of simulations with some factor of speed-up while using less compute resources. In this work, we introduce a neural network based approach (ICONET) to forecast trace gas concentrations without executing the traditional compute-intensive atmospheric simulations. ICONET is equipped with a multifeature Long Short Term Memory (LSTM) model to forecast atmospheric chemicals iteratively in time. We generated the training and test dataset, our target dataset for ICONET, by execution of an atmospheric chemistry simulation in ICON-ART. Applying the ICONET trained model to forecast a test dataset results in a good fit of the forecast values to our target dataset. We discussed appropriate metrics to evaluate the quality of models and presented the quality of the ICONET forecasts with RMSE and KGE metrics. The variety in the nature of trace gases limits the model's learning and forecast skills according to the respective trace gas. In addition to the quality of the ICONET forecasts, we described the computational efficiency of ICONET as its run time speed-up in comparison to the run time of the ICON-ART simulation. The ICONET forecast showed a speed-up factor of 3.1 over the run time of the atmospheric chemistry simulation of ICON-ART, which is a significant achievement, especially when considering the importance of ensemble simulation.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"418 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132000032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Tsaris, Josh Romero, T. Kurth, Jacob Hinkle, Hong-Jun Yoon, Feiyi Wang, Sajal Dash, G. Tourassi
{"title":"Scaling Resolution of Gigapixel Whole Slide Images Using Spatial Decomposition on Convolutional Neural Networks","authors":"A. Tsaris, Josh Romero, T. Kurth, Jacob Hinkle, Hong-Jun Yoon, Feiyi Wang, Sajal Dash, G. Tourassi","doi":"10.1145/3592979.3593401","DOIUrl":"https://doi.org/10.1145/3592979.3593401","url":null,"abstract":"Gigapixel images are prevalent in scientific domains ranging from remote sensing, and satellite imagery to microscopy, etc. However, training a deep learning model at the natural resolution of those images has been a challenge in terms of both, overcoming the resource limit (e.g. HBM memory constraints), as well as scaling up to a large number of GPUs. In this paper, we trained Residual neural Networks (ResNet) on 22,528 x 22,528-pixel size images using a distributed spatial decomposition method on 2,304 GPUs on the Summit Supercomputer. We applied our method on a Whole Slide Imaging (WSI) dataset from The Cancer Genome Atlas (TCGA) database. WSI images can be in the size of 100,000 x 100,000 pixels or even larger, and in this work we studied the effect of image resolution on a classification task, while achieving state-of-the-art AUC scores. Moreover, our approach doesn't need pixel-level labels, since we're avoiding patching from the WSI images completely, while adding the capability of training arbitrary large-size images. This is achieved through a distributed spatial decomposition method, by leveraging the non-block fat-tree interconnect network of the Summit architecture, which enabled GPU-to-GPU direct communication. Finally, detailed performance analysis results are shown, as well as a comparison with a data-parallel approach when possible.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128881341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"StyleGAN as a Deconvolutional Operator for Large Eddy Simulations","authors":"J. Castagna, F. Schiavello","doi":"10.1145/3592979.3593404","DOIUrl":"https://doi.org/10.1145/3592979.3593404","url":null,"abstract":"We present a novel deconvolution operator for Large Eddy Simulation (LES) of turbulent flows based on the latest StyleGAN deep learning networks. We exploit the flexibility of this architecture in separating the different layers of the GAN generator, which can be seen as instantaneous fields of the LES. These can be moved in time via integrating the corresponding filtered Navier-Stokes (NS) equations. The subgrid-scale (SGS) stress tensor is obtained from the reconstructed field, rather than ad-hoc turbulence models. We trained a StyleGAN-based network (MSG-StyleGAN) with 5000 images of a decaying 2D-Homogeneous Isotropic Turbulence (2D-HIT) starting at ReΛ = 60 using a 256x256 grid mesh size. We then reconstructed a DNS simulation, point by point, using a 32x32 resolution via research into the latent space of the GAN until the difference between internal fields and LES fields are within a given tolerance. Results show convergence towards the ground truth DNS solution as the tolerance approaches zero.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":" 24","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133021079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean Merlet, John H. Lagergren, Verónica G. Melesse Vergara, Mikaela Cashman, C. Bradburne, R. Plowright, E. Gurley, Wayne Joubert, Daniel Jacobson
{"title":"Data-Driven Whole-Genome Clustering to Detect Geospatial, Temporal, and Functional Trends in SARS-CoV-2 Evolution","authors":"Jean Merlet, John H. Lagergren, Verónica G. Melesse Vergara, Mikaela Cashman, C. Bradburne, R. Plowright, E. Gurley, Wayne Joubert, Daniel Jacobson","doi":"10.1145/3592979.3593425","DOIUrl":"https://doi.org/10.1145/3592979.3593425","url":null,"abstract":"Current methods for defining SARS-CoV-2 lineages ignore the vast majority of the SARS-CoV-2 genome. We develop and apply an exhaustive vector comparison method that directly compares all known SARS-CoV-2 genome sequences to produce novel lineage classifications. We utilize data-driven models that (i) accurately capture the complex interactions across the set of all known SARS-CoV-2 genomes, (ii) scale to leadership-class computing systems, and (iii) enable tracking how such strains evolve geospatially over time. We show that during the height of the original Omicron surge, countries across Europe, Asia, and the Americas had a spatially asynchronous distribution of Omicron sub-strains. Moreover, neighboring countries were often dominated by either different clusters of the same variant or different variants altogether throughout the pandemic. Analyses of this kind may suggest a different pattern of epidemiological risk than was understood from conventional data, as well as produce actionable insights and transform our ability to prepare for and respond to current and future biological threats.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126010103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Silvina Caíno-Lores, M. Cuendet, Jack D. Marquez, E. Kots, Trilce Estrada, E. Deelman, Harel Weinstein, M. Taufer
{"title":"Runtime Steering of Molecular Dynamics Simulations Through In Situ Analysis and Annotation of Collective Variables","authors":"Silvina Caíno-Lores, M. Cuendet, Jack D. Marquez, E. Kots, Trilce Estrada, E. Deelman, Harel Weinstein, M. Taufer","doi":"10.1145/3592979.3593420","DOIUrl":"https://doi.org/10.1145/3592979.3593420","url":null,"abstract":"This paper targets one of the most common simulations on petascale and, very likely, on exascale machines: molecular dynamics (MD) simulations studying the (classical) time evolution of a molecular system at atomic resolution. Specifically, this work addresses the data challenges of MD simulations at exascale through (1) the creation of a data analysis method based on a suite of advanced collective variables (CVs) selected for annotation of structural molecular properties and capturing rare conformational events at runtime, (2) the definition of an in situ framework to automatically identify the frames where the rare events occur during an MD simulation and (3) the integration of both method and framework into two MD workflows for the study of early termination or termination and restart of a benchmark molecular system for protein folding ---the Fs peptide system (Ace-A_5(AAARA)_3A-NME)--- using Summit. The approach achieves faster exploration of the conformational space compared to extensive ensemble simulations. Specifically, our in situ framework with early termination alone achieves 99.6% coverage of the reference conformational space for the Fs peptide with just 60% of the MD steps otherwise used for a traditional execution of the MD simulation. Annotation-based restart allows us to cover 94.6% of the conformational space, just running 50% of the overall MD steps.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133517702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
À. Alsalti-Baldellou, C. Janna, X. Álvarez-Farré, F. Trias
{"title":"Exploiting symmetries for preconditioning Poisson's equation in CFD simulations","authors":"À. Alsalti-Baldellou, C. Janna, X. Álvarez-Farré, F. Trias","doi":"10.1145/3592979.3593410","DOIUrl":"https://doi.org/10.1145/3592979.3593410","url":null,"abstract":"Divergence constraints are present in the governing equations of many physical phenomena, and they usually lead to a Poisson equation whose solution is one of the most challenging parts of scientific simulation codes. Indeed, it is the main bottleneck of incompressible Computational Fluid Dynamics (CFD) simulations, and developing efficient and scalable Poisson solvers is a critical task. This work presents an enhanced variant of the Factored Sparse Approximate Inverse (FSAI) preconditioner. It arises from exploiting s spatial reflection symmetries, which are often present in academic and industrial configurations and allow transforming Poisson's equation into a set of 2s fully-decoupled subsystems. Then, we introduce another level of approximation by taking advantage of the subsystems' close similarity and applying the same FSAI to all of them. This leads to substantial memory savings and notable increases in the arithmetic intensity resulting from employing the more compute-intensive sparse matrix-matrix product. Of course, recycling the same preconditioner on all the subsystems worsens its convergence. However, this effect was much smaller than expected and made us introduce relatively cheap but very effective low-rank corrections. A key feature of these corrections is that thanks to being applied to each subsystem independently, the more symmetries being exploited, the more effective they become, leading to up to 5.7x faster convergences than the standard FSAI. Numerical experiments on up to 1.07 billion grids confirm the quality of our low-rank corrected FSAI, which, despite being 2.6x lighter, outperforms the standard FSAI by a factor of up to 4.4x.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"3 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120848712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
U. Haus, Timothy Dykes, Aniello Esposito, Clément Foyer, Adrian Tate
{"title":"Universal Data Junction: A Transport Layer for Data Driven Workflows","authors":"U. Haus, Timothy Dykes, Aniello Esposito, Clément Foyer, Adrian Tate","doi":"10.1145/3592979.3593423","DOIUrl":"https://doi.org/10.1145/3592979.3593423","url":null,"abstract":"A novel transport library for the efficient coupling of applications through their data dependencies is presented. The design is driven by the intent to require minimal changes to existing scientific applications and to declare the data objects that are meaningful for other applications for read and write as well as to perform transparent transport including automatic redistribution of parallel data structures, thus permitting seamless coupling of applications in workflows. The actual transport can be selected at run time, and can exploit a variety of data exchange methods, including MPI, Dataspaces, Ceph Rados, CRAY Datawarp, and a POSIX file system. For the case of MPI transport, the library is used to implement the first stage of a co-working visualization pipeline for CP2K and results show a significant advantage compared to a filesystem based approach.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126700157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jennifer Faj, Tobias Kenter, S. Faghih-Naini, Christian Plessl, V. Aizinger
{"title":"Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Water Model on Unstructured Meshes","authors":"Jennifer Faj, Tobias Kenter, S. Faghih-Naini, Christian Plessl, V. Aizinger","doi":"10.1145/3592979.3593407","DOIUrl":"https://doi.org/10.1145/3592979.3593407","url":null,"abstract":"FPGAs are fostering interest as energy-efficient accelerators for scientific simulations, including for methods operating on unstructured meshes. Considering the potential impact on high-performance computing, specific attention needs to be given to the scalability of such approaches. In this context, the networking capabilites of FPGA hardware and software stacks can play a crucial role to enable solutions that go beyond a traditional host-MPI and accelerator-offload model. In this work, we present the multi-FPGA scaling of a discontinuous Galerkin shallow water model using direct low-latency streaming communication between the FPGAs. To this end, the unstructured mesh defining the spatial domain of the simulation is partitioned, the inter-FPGA network is configured to match the topology of neighboring partitions, and halo communication is overlapped with the dataflow computation pipeline. With this approach, we demonstrate strong scaling on up to eight FPGAs with a parallel efficiency of >80% and execution times per time step of as low as 7.6 μs. At the same time, with weak scaling, the approach allows to simulate larger meshes that would exceed the local memory limits of a single FPGA, now supporting meshes up to more than 100,000 elements and reaching an aggregated performance of up to 6.5 TFLOPs. Finally, a hierarchical partitioning approach allows for better utilization of the FPGA compute resources in some designs and, by mitigating limitations posed by the communication topology, enables simulations with up to 32 partitions on 8 FPGAs.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124436741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Moulinec, G. Houzeaux, R. Borrell, Adria Quintanas Corominas, G. Oyarzun, Judicael Grasset, G. Giuntoli, M. Vázquez
{"title":"A Massively Parallel Multi-Scale FE2 Framework for Multi-Trillion Degrees of Freedom Simulations","authors":"C. Moulinec, G. Houzeaux, R. Borrell, Adria Quintanas Corominas, G. Oyarzun, Judicael Grasset, G. Giuntoli, M. Vázquez","doi":"10.1145/3592979.3593415","DOIUrl":"https://doi.org/10.1145/3592979.3593415","url":null,"abstract":"The advent of hybrid CPU and accelerator supercomputers opens the door to extremely large multi-scale simulations. An example of such a multi-scale technique, the FE2 approach, has been designed to simulate material deformations, by getting a better estimation of the material properties, which, in effect, reduces the need to introduce physical modelling at macro-scale level, such as constitutive laws, for instance. Both macro- and micro-scales are solved using the Finite Element method, the micro-scale being resolved at the Gauss points of the macro-scale mesh. As the micro-scale simulations do not require any information from each other, and are thus run concurrently, the stated problem is embarrassingly parallel. The FE2 method therefore directly benefits from hybrid machines, the macro-scale being solved on CPU whereas the micro-scale is offloaded to accelerators. The case of a flat plate, made of different materials is used to illustrate the potential of the method. In order to ensure good load balance on distributed memory machines, weighting based on the type of materials the plate is made of is applied by means of a Space Filling Curve technique. Simulations have been carried out for over 5 trillions of degrees of freedom on up to 2,048 nodes (49,152 CPUs and 12,288 GPUs) of the US DOE Oak Ridge National Laboratory high-end machine, Summit, showing an excellent speed-up for the assembly part of the framework, where the micro-scale is computed on GPU using CUDA.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"52 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120970003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}