The International Conference on High Performance Computing in Asia-Pacific Region Companion最新文献

Advantages of Space-Time Finite Elements for Domains with Time Varying Topology 时变拓扑域的空时有限元优势

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI: 10.1145/3440722.3440907

N. Hosters, Maximilian von Danwitz, Patrick Antony, M. Behr

引用次数: 0

Single-Precision Calculation of Iterative Refinement of Eigenpairs of a Real Symmetric-Definite Generalized Eigenproblem by Using a Filter Composed of a Single Resolvent 用单解组成的滤波器迭代求实对称定广义特征问题特征对的单精度计算

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI: 10.1145/3440722.3440784

H. Murakami

{"title":"Single-Precision Calculation of Iterative Refinement of Eigenpairs of a Real Symmetric-Definite Generalized Eigenproblem by Using a Filter Composed of a Single Resolvent","authors":"H. Murakami","doi":"10.1145/3440722.3440784","DOIUrl":"https://doi.org/10.1145/3440722.3440784","url":null,"abstract":"By using a filter, we calculate approximate eigenpairs of a real symmetric-definite generalized eigenproblem Av = λBv whose eigenvalues are in a specified interval. In our experiments in this paper, the IEEE-754 single-precision floating-point (binary 32bit) number system is used for calculations. In general, a filter is constructed by using some resolvents with different shifts ρ. For a given vector x, an action of a resolvent is given by solving a system of linear equations C(ρ)y = Bx for y, here the coefficient C(ρ) = A − ρB is symmetric. We assume to solve this system of linear equations by matrix factorization of C(ρ), for example by the modified Cholesky method (LDLT decomposition method). When both matrices A and B are banded, C(ρ) is also banded and the modified Cholesky method for banded system can be used to solve the system of linear equations. The filter we used is either a polynomial of a resolvent with a real shift, or a polynomial of an imaginary part of a resolvent with an imaginary shift. We use only a single resolvent to construct the filter in order to reduce both amounts of calculation to factor matrices and especially storage to hold factors of matrices. The most disadvantage when we use only a single resolvent rather than many is, such a filter have poor properties especially when compuation is made in single-precision. Therefore, approximate eigenpairs required are not obtained in good accuracy if they are extracted from the set of vectors made by an application of a combination of B-orthonormalization and filtering to a set of initial random vectors. However, experiments show approximate eigenpairs required are refined well if they are extracted from the set of vectors obtained by a few applications of a combination of B-orthonormalization and filtering to a set of initial random vectors.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131374910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Molecular-Continuum Flow Simulation in the Exascale and Big Data Era 百亿亿次和大数据时代的分子连续流模拟

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI: 10.1145/3440722.3440903

Philipp Neumann, Vahid Jafari, P. Jarmatz, F. Maurer, Helene Wittenberg, Niklas Wittmer

引用次数: 0

Multi-scale Modelling of Urban Air Pollution with Coupled Weather Forecast and Traffic Simulation on HPC Architecture 基于HPC架构的城市大气污染天气预报与交通模拟耦合多尺度模型

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI: 10.1145/3440722.3440917

L. Kornyei, Z. Horváth, A. Ruopp, Á. Kovács, Bence Liszkai

{"title":"Multi-scale Modelling of Urban Air Pollution with Coupled Weather Forecast and Traffic Simulation on HPC Architecture","authors":"L. Kornyei, Z. Horváth, A. Ruopp, Á. Kovács, Bence Liszkai","doi":"10.1145/3440722.3440917","DOIUrl":"https://doi.org/10.1145/3440722.3440917","url":null,"abstract":"Urban air pollution is one of the global challenges to which over 3 million deaths are attributable yearly. Traffic is emitting over 40% of several contaminants, like NO2 [10]. The directive 2008/50/EC of the European Commission prescribes the assessment air quality by accumulating exceedance of contamination concentration limits over a one-year period using measurement stations, which may be supplemented by modeling techniques to provide adequate information on spatial distribution. Computational models do predict that small scale spatial fluctuation is expected on the street level: local air flow phenomena can cluster up pollutants or carry them away far from the location of emission [2]. The spread of the SARS-CoV-2 virus also interacts with urban air quality. Regions in lock down have highly reduced air pollution strain due to the drop of traffic [4]. Also, correlation between the fatality rate of a previous respiratory disease, SARS 2002, and Air Pollution Index suggests that bad air quality may double fatality rate [6]. At street level pollution dispersion highly depends on the daily weather, a one-year simulation low time scale model is needed. Additionally, to resolve street-level phenomena a cell size of 1 to 4 meters are utilized in these regions that requires CFD methods to use a simulation domain of 1 to 100 million cells. Memory and computational requirements for these tasks are enormous, so HPC architecture is needed to have reasonable results within a manageable time frame. To tackle this challenge, the Urban Air Pollution (UAP) workflow is developed as a pilot of the HiDALGO project [7], which is funded by the H2020 framework of the European Union. The pilot is designed in a modular way with the mindset to be developed into a digital twin model later. Its standardized interfaces enable multiple software to be used in a specific module. At its core, a traffic simulation implemented in SUMO is coupled with a CFD simulation. Currently OpenFOAM (v1906, v1912 and v2006) and Ansys Fluent (v19.2) are supported. This presentation focuses on the OpenFOAM implementation, as it proved more feasible and scalable on most HPC architectures. The incompressible unsteady Reynolds-averaged Navier– Stokes equations are solved with the PIMPLE method, Courant-number based adaptive time stepping and transient atmospheric boundary conditions. The single component NOx-type pollution is calculated independently as a scalar with transport equations along the flow field. Pollution emission is treated as a per cell volumetric source that changes in time. The initial condition is obtained from a steady state solution at the initial time with the SIMPLE method, using the identical, but stationary boundary conditions and source fields. Custom modules are developed for proper boundary condition and source term handling. The UAP workflow supports automatic 3D air flow geometry and traffic network generation from OpenStreetMap data. Ground and building information","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114143442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow 基于TensorFlow的Intel至强架构分布式MLPerf ResNet50训练

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI: 10.1145/3440722.3440880

Wei Wang, N. Hasabnis

引用次数: 0

Node-level Performance Optimizations in CFD Codes CFD代码中的节点级性能优化

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI: 10.1145/3440722.3440914

Peter Wauligmann, Jakob Dürrwächter, Philipp Offenhäuser, A. Schlottke, M. Bernreuther, B. Dick

引用次数: 3

High Performance Simulations of Quantum Transport using Manycore Computing 基于多核计算的量子传输的高性能模拟

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI: 10.1145/3440722.3440879

Yosang Jeong, H. Ryu

{"title":"High Performance Simulations of Quantum Transport using Manycore Computing","authors":"Yosang Jeong, H. Ryu","doi":"10.1145/3440722.3440879","DOIUrl":"https://doi.org/10.1145/3440722.3440879","url":null,"abstract":"The Non-Equilibrium Green’s Function (NEGF) has been widely utilized in the field of nanoscience and nanotechnology to predict carrier transport behaviors in electronic device channels of sizes in a quantum regime. This work explores how much performance improvement can be driven for NEGF computations with unique features of manycore computing, where the core numerical step of NEGF computations involves a recursive process of matrix-matrix multiplication. The major techniques adopted for the performance enhancement are data-restructuring, matrix-tiling, thread-scheduling, and offload computing and we present in-depth discussion on why they are critical to fully exploit the power of manycore computing hardware including Intel Xeon Phi Knights Landing systems and NVIDIA general-purpose graphic processing unit (GPU) devices. Performance of the optimized algorithm has been tested in a single computing node, where the host is Xeon Phi 7210 that is equipped with two NVIDIA Quadro GV100 GPU devices. The target structure of NEGF simulations is a [100] silicon nanowire that consists of 100K atoms involving a 1000K × 1000K complex Hamiltonian matrix. Through rigorous benchmark tests, we show, with optimization techniques whose details are elaborately explained, the workload can be accelerated almost by a factor of up to ∼ 20 compared to the unoptimized case.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132977226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Comparison of Parallel Profiling Tools for Programs utilizing the FFT 利用FFT的程序并行分析工具的比较

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI: 10.1145/3440722.3440881

B. Leu, S. Aseeri, B. Muite

引用次数: 0

Efficient Parallel Multigrid Method on Intel Xeon Phi Clusters Intel Xeon Phi集群的高效并行多网格方法

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI: 10.1145/3440722.3440882

K. Nakajima, Balazs Gerofi, Y. Ishikawa, Masashi Horikoshi

{"title":"Efficient Parallel Multigrid Method on Intel Xeon Phi Clusters","authors":"K. Nakajima, Balazs Gerofi, Y. Ishikawa, Masashi Horikoshi","doi":"10.1145/3440722.3440882","DOIUrl":"https://doi.org/10.1145/3440722.3440882","url":null,"abstract":"The parallel multigrid method is expected to play an important role in scientific computing on exa-scale supercomputer systems for solving large-scale linear equations with sparse matrices. Because solving sparse linear systems is a very memory-bound process, efficient method for storage of coefficient matrices is a crucial issue. In the previous works, authors implemented sliced ELL method to parallel conjugate gradient solvers with multigrid preconditioning (MGCG) for the application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM), and excellent performance has been obtained on large-scale multicore/manycore clusters. In the present work, authors introduced SELL-C-σ to the MGCG solver, and evaluated the performance of the solver with various types of OpenMP/MPI hybrid parallel programing models on the Oakforest-PACS (OFP) system at JCAHPC using up to 1,024 nodes of Intel Xeon Phi. Because SELL-C-σ is suitable for wide-SIMD architecture, such as Xeon Phi, improvement of the performance over the sliced ELL was more than 20%. This is one of the first examples of SELL-C-σ applied to forward/backward substitutions in ILU-type smoother of multigrid solver. Furthermore, effects of IHK/McKernel has been investigated, and it achieved 11% improvement on 1,024 nodes.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122273865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

An efficient halo approach for Euler-Lagrange simulations based on MPI-3 shared memory 一种基于MPI-3共享内存的欧拉-拉格朗日模拟晕轮方法

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI: 10.1145/3440722.3440904

Patrick Kopper, M. Pfeiffer, S. Copplestone, A. Beck

{"title":"An efficient halo approach for Euler-Lagrange simulations based on MPI-3 shared memory","authors":"Patrick Kopper, M. Pfeiffer, S. Copplestone, A. Beck","doi":"10.1145/3440722.3440904","DOIUrl":"https://doi.org/10.1145/3440722.3440904","url":null,"abstract":"Euler-Lagrange methods are a common approach for simulation of dispersed particle-laden flow, e.g. in turbomachinery. In this approach, the fluid is treated as continuous phase with an Eulerian field solver whereas the Lagrangian movement of the dispersed phase is described through the equations of motion for each individual particle. In high-performance computing, the load of the fluid phase is only dependent on the degrees of freedom and load-balancing steps can be taken a priori, thereby ensuring optimal scaling. However, the discrete phase introduces local load imbalances that cannot easily predicted as generally neither the spatial particle distribution nor the computational cost for advancing particles in relation to the fluid integration are know a priori. Runtime load balancing alleviates this problem by adjusting the local load on each processor according to information gathered during the simulation [4]. Since the load balancing step becomes part of the simulation time, its performance and appropriate scaling on modern HPC systems becomes of crucial importance. In this talk, we will first present the FLEXI framework for the Euler-Lagrange system, and follow by introducing the previous approach and highlight its difficulties. FLEXI is a high-order accurate, massively parallel CFD framework based on the Discontinuous Galerkin Spectral Element Method (DGSEM). It has shown excellent scaling properties for the fluid phase and was recently extended by particle tracking capabilities [1], developed together with the PICLas framework [2]. In FLEXI, the mesh is saved in the HDF5 format, allowing for parallel access, with the elements presorted along a space-filling curve (SFC). This approach has shown its suitability for fluid simulations as each processor requires and accesses only the local mesh information, thereby reducing I/O on the underlying file system [3]. However, the particle phase needs additional information around the fluid domain to retain high computational efficiency since particles can cross the local domain boundary at any point during a time step. In previous implementations, this “halo region” information was communicated between each individual processor, causing significant CPU and network load for an extended period of time during initialization and each load balancing step. Therefore, we propose an method developed from scratch utilizing modern MPI calls and able to overcome most of the challenges in the previous approach. This reworked method utilizes MPI-3 shared memory to make mesh information available to all processors on a compute-node. We perform a two-step, communication-free identification of all relevant mesh elements for a compute-node. Furthermore, by making the mesh information accessible to all processors sharing local memory, we eliminate redundant calculations and reduce data duplication. We conclude by presenting examples of large scale computations of particle-laden flows in complex turbomachinery system","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116826805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1