Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献_第3页

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library 减少共享内存占用，以利用张量内核的高吞吐量及其灵活的API扩展库

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578238

Hiroyuki Ootomo, Rio Yokota

{"title":"Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library","authors":"Hiroyuki Ootomo, Rio Yokota","doi":"10.1145/3578178.3578238","DOIUrl":"https://doi.org/10.1145/3578178.3578238","url":null,"abstract":"Matrix-matrix multiplication is used for various linear algebra algorithms such as matrix decomposition and tensor contraction. NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom kernel functions. The most common way to use Tensor Core is to supply the input matrices from shared memory, which has higher bandwidth than global memory. However, the Bytes-per-Flops (B/F) ratio of the shared memory and Tensor Cores is small since the performance of Tensor Cores is high. Thus, it is important to reduce the shared memory footprint for efficient Tensor Cores usage. In this paper, we analyze the simple matrix-matrix multiplication on Tensor Cores by the roofline model and figure out that the bandwidth of shared memory might be a limitation of the performance when using WMMA API. To alleviate this issue, we provide a WMMA API extension library to boost the throughput of the computation, which has two components. The first one allows for manipulating the array of registers input to Tensor Cores flexibly. We evaluate the performance improvement of this library. The outcome of our evaluation shows that our library reduces the shared memory footprint and speeds up the computation using Tensor Cores. The second one is an API for the SGEMM emulation on Tensor Cores without additional shared memory usage. We have demonstrated that the single-precision emulating batch SGEMM implementation on Tensor Cores using this library achieves 54.2 TFlop/s on A100 GPU, which outperforms the theoretical peak performance of FP32 SIMT Cores while achieving the same level of accuracy as cuBLAS. The achieved throughput can not be achieved without reducing the shared memory footprint done by our library with the same amount of register usage.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122530276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Case Study on DaCe Portability & Performance for Batched Discrete Fourier Transforms 批处理离散傅里叶变换的DaCe可移植性与性能研究

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578239

Måns I. Andersson, S. Markidis

引用次数: 2

ESSPER: Elastic and Scalable FPGA-Cluster System for High-Performance Reconfigurable Computing with Supercomputer Fugaku 基于Fugaku超级计算机的高性能可重构计算的弹性可扩展fpga集群系统

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3579341

K. Sano, Atsushi Koshiba, Takaaki Miyajima, Tomohiro Ueno

引用次数: 2

Parallelization of Automatic Tuning for Hyperparameter Optimization of Pedestrian Route Prediction Applications using Machine Learning 基于机器学习的行人路径预测超参数优化自动调优并行化

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578235

Sorataro Fujika, Yuga Yajima, Teruo Tanaka, A. Fujii, Yuka Kato, S. Ohshima, T. Katagiri

{"title":"Parallelization of Automatic Tuning for Hyperparameter Optimization of Pedestrian Route Prediction Applications using Machine Learning","authors":"Sorataro Fujika, Yuga Yajima, Teruo Tanaka, A. Fujii, Yuka Kato, S. Ohshima, T. Katagiri","doi":"10.1145/3578178.3578235","DOIUrl":"https://doi.org/10.1145/3578178.3578235","url":null,"abstract":"We study software automatic tuning. Automatic tuning tools using iterative one-dimensional search estimate hyperparameters of machine learning programs. Iterative one-dimensional search searches the parameter space consisting of possible values of the parameters to be tuned by repeatedly measuring and evaluating the target program. Since it takes time to train a machine learning program, estimating the optimal hyperparameters is time-consuming. Therefore, we propose a method to reduce the time required for automatic tuning by parallelization of iterative one-dimensional search. For parallelization, we use multiple job execution on a supercomputer that can utilize multiple GPUs, which is effective for machine learning. In this method, each job measures different hyperparameters. The next search point is determined by referring to the data obtained from each job. The target program is a pedestrian path prediction application. This program predicts future routes and arrival points based on past pedestrian trajectory data. The program is intended to be used in a variety of locations, and the locations and movement patterns will vary depending on the dataset used for training. We hypothesized that the estimation results of one dataset could be used for automatic tuning of another dataset, thereby reducing the time required for automatic tuning. Experimental results confirm that the parallelized iterative one-dimensional search reduces the estimation time from 89.5 hours to 4 hours compared to the sequential search. We also show that the iterative one-dimensional search efficiently investigates the point at which the performance index improves. Moreover, the hyperparameters estimated for one data set are used as the initial point for the search and automatic tuning for another data set. Compared to the results of automatic tuning with the currently used hyperparameters as the initial values, both the number of executions and execution time were reduced.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127611260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparison of Reproducible Parallel Preconditioned BiCGSTAB Algorithm Based on ExBLAS and ReproBLAS 基于ExBLAS和reblas的可重复并行预处理bicstab算法的比较

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578234

X. Lei, Tongxiang Gu, S. Graillat, Xiaowen Xu, Jing Meng

引用次数: 1

Efficient Large Integer Multiplication with Arm SVE Instructions Arm SVE指令的高效大整数乘法

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578193

Takuya Edamatsu, D. Takahashi

引用次数: 1

GPU–FPGA-accelerated Radiative Transfer Simulation with Inter-FPGA Communication 基于fpga间通信的gpu - fpga加速辐射传输仿真

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578231

Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, Makito Abe, M. Umemura

{"title":"GPU–FPGA-accelerated Radiative Transfer Simulation with Inter-FPGA Communication","authors":"Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, Makito Abe, M. Umemura","doi":"10.1145/3578178.3578231","DOIUrl":"https://doi.org/10.1145/3578178.3578231","url":null,"abstract":"The complementary use of graphics processing units (GPUs) and field programmable gate arrays (FPGAs) is a major topic of interest in the high-performance computing (HPC) field. GPU–FPGA-accelerated computing is an effective tool for multiphysics simulations, which encompass multiple physical models and simultaneous physical phenomena. Because the constituent operations in multiphysics simulations exhibit varying characteristics, accelerating these operations solely using GPUs is often challenging. Hence, FPGAs are frequently implemented for this purpose. The objective of the present study was to further improve application performance by employing both GPUs and FPGAs in a complementary manner. Recently, this approach has been applied to the radiative transfer simulation code for astrophysics known as ARGOT, with evaluation results quantitatively demonstrating the resulting improvement in performance. However, the evaluation results in question came from the use of a single node equipped with both a GPU and FPGA. In this study, we extended the GPU–FPGA-accelerated ARGOT code to operate on multiple nodes using the message passing interface (MPI) and an FPGA-to-FPGA communication technology scheme called Communication Integrated Reconfigurable CompUting System (CIRCUS). We evaluated the performance of the ARGOT code with multiple GPUs and FPGAs under weak scaling conditions, and found it to achieve up to 12.8x speedup compared to the GPU-only execution.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134416637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Effectiveness of the Oversubscribing Scheduling on Supercomputer Systems 超级计算机系统超订阅调度的有效性

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578221

Shohei Minami, Toshio Endo, Akihiro Nomura

引用次数: 0

Fault-Tolerant LOBPCG for Nuclear CI Calculations 核CI计算的容错LOBPCG

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578240

Meiyue Shao, Dossay Oryspayev, Chao Yang, Pieter Maris, B. Cook

引用次数: 1

A new data conversion method for mixed precision Krylov solvers with FP16/BF16 Jacobi preconditioners 基于FP16/BF16 Jacobi预调节器的混合精度Krylov解算器数据转换新方法

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578222

Takuya Ina, Y. Idomura, Toshiyuki Imamura, Naoyuki Onodera

{"title":"A new data conversion method for mixed precision Krylov solvers with FP16/BF16 Jacobi preconditioners","authors":"Takuya Ina, Y. Idomura, Toshiyuki Imamura, Naoyuki Onodera","doi":"10.1145/3578178.3578222","DOIUrl":"https://doi.org/10.1145/3578178.3578222","url":null,"abstract":"Mixed precision Krylov solvers with the Jacobi preconditioner often show significant convergence degradation when the Jacobi preconditioner is computed in low precision such as FP16 and BF16. It is found that this convergence degradation is attributed to loss of diagonal dominance due to roundoff errors in data conversion. To resolve this issue, we propose a new data conversion method, which is designed to keep diagonal dominance of the original matrix data. The proposed method is tested by computing the Poisson equation using the conjugate gradient method, the general minimum residual method, and the biconjugate gradient stabilized method with the FP16/BF16 Jacobi preconditioner on NVIDIA V100 GPUs. Here, the new data conversion is implemented by switching the round-nearest, round-up, round-down, and round-towards-zero intrinsics in CUDA, and is called once before the main iteration. Therefore, the cost of the new data conversion is negligible. When the coefficients of matrix is continuously changed by scaling the linear system, the conventional data conversion based on the round-nearest intrinsic shows periodic changes of the convergence property depending on the difference of the roundoff errors between diagonal and off-diagonal coefficients. Here, the period and magnitude of the convergence degradation depend on the bit length of significand. On the other hand, the proposed data conversion method is shown to fully avoid the convergence degradation, and robust mixed precision computing is enabled for the Jacobi preconditioner without extra overheads.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"84 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127976838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0