{"title":"Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library","authors":"Hiroyuki Ootomo, Rio Yokota","doi":"10.1145/3578178.3578238","DOIUrl":"https://doi.org/10.1145/3578178.3578238","url":null,"abstract":"Matrix-matrix multiplication is used for various linear algebra algorithms such as matrix decomposition and tensor contraction. NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom kernel functions. The most common way to use Tensor Core is to supply the input matrices from shared memory, which has higher bandwidth than global memory. However, the Bytes-per-Flops (B/F) ratio of the shared memory and Tensor Cores is small since the performance of Tensor Cores is high. Thus, it is important to reduce the shared memory footprint for efficient Tensor Cores usage. In this paper, we analyze the simple matrix-matrix multiplication on Tensor Cores by the roofline model and figure out that the bandwidth of shared memory might be a limitation of the performance when using WMMA API. To alleviate this issue, we provide a WMMA API extension library to boost the throughput of the computation, which has two components. The first one allows for manipulating the array of registers input to Tensor Cores flexibly. We evaluate the performance improvement of this library. The outcome of our evaluation shows that our library reduces the shared memory footprint and speeds up the computation using Tensor Cores. The second one is an API for the SGEMM emulation on Tensor Cores without additional shared memory usage. We have demonstrated that the single-precision emulating batch SGEMM implementation on Tensor Cores using this library achieves 54.2 TFlop/s on A100 GPU, which outperforms the theoretical peak performance of FP32 SIMT Cores while achieving the same level of accuracy as cuBLAS. The achieved throughput can not be achieved without reducing the shared memory footprint done by our library with the same amount of register usage.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122530276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Case Study on DaCe Portability & Performance for Batched Discrete Fourier Transforms","authors":"Måns I. Andersson, S. Markidis","doi":"10.1145/3578178.3578239","DOIUrl":"https://doi.org/10.1145/3578178.3578239","url":null,"abstract":"With the emergence of new computer architectures, portability and performance-portability become significant concerns for developing HPC applications. This work reports our experience and lessons learned using DaCe to create and optimize batched Discrete Fourier Transform (DFT) calculations on different single node computer systems. The batched DFT calculation is an essential component in FFT algorithms and is widely used in computer science, numerical analysis, and signal processing. We implement the batched DFT with three complex-value array data layouts and compare them with the native complex type implementation. We use DaCe, which relies on Stateful DataFlow multiGraphs (SDFG) as an intermediate representation (IR) which can be optimized through transforms and then generates code for different architectures. We present several performance results showcasing the potential of DaCe for expressing HPC applications on different computer systems.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115482359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Sano, Atsushi Koshiba, Takaaki Miyajima, Tomohiro Ueno
{"title":"ESSPER: Elastic and Scalable FPGA-Cluster System for High-Performance Reconfigurable Computing with Supercomputer Fugaku","authors":"K. Sano, Atsushi Koshiba, Takaaki Miyajima, Tomohiro Ueno","doi":"10.1145/3578178.3579341","DOIUrl":"https://doi.org/10.1145/3578178.3579341","url":null,"abstract":"FPGA clusters have yet to be a mainstream of HPC, even for accelerators, and several challenges exist in their architecture and system organization. This work presents ESSPER, a flexible and scalable FPGA cluster prototype system for reconfigurable HPC to meet the concept of customizability, scalability, and interoperability with existing HPC systems. Based on our classification of FPGA cluster architectures, we propose a new category of FPGA clusters with a host-FPGA bridging network using software-bridged APIs for the use of remote FPGAs. We have designed, implemented, verified, and demonstrated a proof-of-concept system of ESSPER, as a functional extension of the supercomputer Fugaku.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123707707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sorataro Fujika, Yuga Yajima, Teruo Tanaka, A. Fujii, Yuka Kato, S. Ohshima, T. Katagiri
{"title":"Parallelization of Automatic Tuning for Hyperparameter Optimization of Pedestrian Route Prediction Applications using Machine Learning","authors":"Sorataro Fujika, Yuga Yajima, Teruo Tanaka, A. Fujii, Yuka Kato, S. Ohshima, T. Katagiri","doi":"10.1145/3578178.3578235","DOIUrl":"https://doi.org/10.1145/3578178.3578235","url":null,"abstract":"We study software automatic tuning. Automatic tuning tools using iterative one-dimensional search estimate hyperparameters of machine learning programs. Iterative one-dimensional search searches the parameter space consisting of possible values of the parameters to be tuned by repeatedly measuring and evaluating the target program. Since it takes time to train a machine learning program, estimating the optimal hyperparameters is time-consuming. Therefore, we propose a method to reduce the time required for automatic tuning by parallelization of iterative one-dimensional search. For parallelization, we use multiple job execution on a supercomputer that can utilize multiple GPUs, which is effective for machine learning. In this method, each job measures different hyperparameters. The next search point is determined by referring to the data obtained from each job. The target program is a pedestrian path prediction application. This program predicts future routes and arrival points based on past pedestrian trajectory data. The program is intended to be used in a variety of locations, and the locations and movement patterns will vary depending on the dataset used for training. We hypothesized that the estimation results of one dataset could be used for automatic tuning of another dataset, thereby reducing the time required for automatic tuning. Experimental results confirm that the parallelized iterative one-dimensional search reduces the estimation time from 89.5 hours to 4 hours compared to the sequential search. We also show that the iterative one-dimensional search efficiently investigates the point at which the performance index improves. Moreover, the hyperparameters estimated for one data set are used as the initial point for the search and automatic tuning for another data set. Compared to the results of automatic tuning with the currently used hyperparameters as the initial values, both the number of executions and execution time were reduced.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127611260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. Lei, Tongxiang Gu, S. Graillat, Xiaowen Xu, Jing Meng
{"title":"Comparison of Reproducible Parallel Preconditioned BiCGSTAB Algorithm Based on ExBLAS and ReproBLAS","authors":"X. Lei, Tongxiang Gu, S. Graillat, Xiaowen Xu, Jing Meng","doi":"10.1145/3578178.3578234","DOIUrl":"https://doi.org/10.1145/3578178.3578234","url":null,"abstract":"Krylov subspace algorithms are important methods for solving linear systems. In order to efficiently solve large-scale linear systems, parallelism techniques are often applied. However, parallelism often enlarge the non-associativity of floating-point operations, which can lead to non-reproducibility of the computations. This paper compares the performance of the parallel preconditioned BiCGSTAB algorithm implemented with two different libraries (ExBLAS and ReproBLAS) that can ensure the reproducibility of computations. To address the effect of the compiler, we explicitly utilize the FMA instructions. Finally, numerical experiments show that based on two BLAS implementations, the BiCGSTAB algorithms are reproducible. By contrast, the BiCGSTAB algorithm based on ExBLAS is more accurate but more time-consuming than the one based on ReproBLAS.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124571618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Large Integer Multiplication with Arm SVE Instructions","authors":"Takuya Edamatsu, D. Takahashi","doi":"10.1145/3578178.3578193","DOIUrl":"https://doi.org/10.1145/3578178.3578193","url":null,"abstract":"In this study, we implement large integer multiplication with the Arm Scalable Vector Extension (SVE) instructions. SVE is a single instruction, multiple data (SIMD) instruction set for the Arm AArch64 architecture. We use a reduced-radix representation technique because SIMD instructions do not retain the carry that occurs when partial products are added in large integer multiplication computations. Furthermore, we develop and implement a multiplication algorithm based on the Basecase method, which allows the application of ordinary multiplication instructions to special integers in reduced-radix representation. To evaluate performance, we compare our multiplication implementation on an A64FX processor with the GNU Multiple Precision Arithmetic Library (GMP). We show that processing with SVE was faster than GMP for multiplication with operands larger than 2,048 bits. The performance gain was up to 36%. These results suggest that SVE instructions have the potential to be faster than scalar instructions for large integer multiplication, especially for large operands.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125734974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, Makito Abe, M. Umemura
{"title":"GPU–FPGA-accelerated Radiative Transfer Simulation with Inter-FPGA Communication","authors":"Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, Makito Abe, M. Umemura","doi":"10.1145/3578178.3578231","DOIUrl":"https://doi.org/10.1145/3578178.3578231","url":null,"abstract":"The complementary use of graphics processing units (GPUs) and field programmable gate arrays (FPGAs) is a major topic of interest in the high-performance computing (HPC) field. GPU–FPGA-accelerated computing is an effective tool for multiphysics simulations, which encompass multiple physical models and simultaneous physical phenomena. Because the constituent operations in multiphysics simulations exhibit varying characteristics, accelerating these operations solely using GPUs is often challenging. Hence, FPGAs are frequently implemented for this purpose. The objective of the present study was to further improve application performance by employing both GPUs and FPGAs in a complementary manner. Recently, this approach has been applied to the radiative transfer simulation code for astrophysics known as ARGOT, with evaluation results quantitatively demonstrating the resulting improvement in performance. However, the evaluation results in question came from the use of a single node equipped with both a GPU and FPGA. In this study, we extended the GPU–FPGA-accelerated ARGOT code to operate on multiple nodes using the message passing interface (MPI) and an FPGA-to-FPGA communication technology scheme called Communication Integrated Reconfigurable CompUting System (CIRCUS). We evaluated the performance of the ARGOT code with multiple GPUs and FPGAs under weak scaling conditions, and found it to achieve up to 12.8x speedup compared to the GPU-only execution.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134416637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effectiveness of the Oversubscribing Scheduling on Supercomputer Systems","authors":"Shohei Minami, Toshio Endo, Akihiro Nomura","doi":"10.1145/3578178.3578221","DOIUrl":"https://doi.org/10.1145/3578178.3578221","url":null,"abstract":"High responsiveness is substantial for users’ satisfaction in supercomputer systems. Recently, the use of interactive jobs in addition to traditional batch jobs is attracting attention. It is getting important to handle those jobs consolidated for responsive systems. Here we show oversubscribing scheduling, in which multiple HPC jobs share computational resources, can effectively process jobs. This paper builds the job scheduling simulator considering oversubscribing and evaluates the oversubscribing system using actual supercomputer workload trace data. While keeping the short users’ response time, our solution achieves some strengths not found in a conventional solution; benefits on normal jobs, alleviating the slowdown, and the unnecessariness of the effort of good system configuration.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126164759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meiyue Shao, Dossay Oryspayev, Chao Yang, Pieter Maris, B. Cook
{"title":"Fault-Tolerant LOBPCG for Nuclear CI Calculations","authors":"Meiyue Shao, Dossay Oryspayev, Chao Yang, Pieter Maris, B. Cook","doi":"10.1145/3578178.3578240","DOIUrl":"https://doi.org/10.1145/3578178.3578240","url":null,"abstract":"Exascale computing platforms with millions of compute units and with thousands of nodes are predicted to experience frequent faults which interrupt applications’ execution. In this context resilience against faults becomes important. We examine user and software level fault mitigation strategies in a distributed LOBPCG algorithm targeting nuclear CI calculations. In particular, we present and evaluate one strategy that keeps the total number of fault-tolerant LOBPCG iterations close to that of the standard LOBPCG algorithm ran on a fault-free machine.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133921516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takuya Ina, Y. Idomura, Toshiyuki Imamura, Naoyuki Onodera
{"title":"A new data conversion method for mixed precision Krylov solvers with FP16/BF16 Jacobi preconditioners","authors":"Takuya Ina, Y. Idomura, Toshiyuki Imamura, Naoyuki Onodera","doi":"10.1145/3578178.3578222","DOIUrl":"https://doi.org/10.1145/3578178.3578222","url":null,"abstract":"Mixed precision Krylov solvers with the Jacobi preconditioner often show significant convergence degradation when the Jacobi preconditioner is computed in low precision such as FP16 and BF16. It is found that this convergence degradation is attributed to loss of diagonal dominance due to roundoff errors in data conversion. To resolve this issue, we propose a new data conversion method, which is designed to keep diagonal dominance of the original matrix data. The proposed method is tested by computing the Poisson equation using the conjugate gradient method, the general minimum residual method, and the biconjugate gradient stabilized method with the FP16/BF16 Jacobi preconditioner on NVIDIA V100 GPUs. Here, the new data conversion is implemented by switching the round-nearest, round-up, round-down, and round-towards-zero intrinsics in CUDA, and is called once before the main iteration. Therefore, the cost of the new data conversion is negligible. When the coefficients of matrix is continuously changed by scaling the linear system, the conventional data conversion based on the round-nearest intrinsic shows periodic changes of the convergence property depending on the difference of the roundoff errors between diagonal and off-diagonal coefficients. Here, the period and magnitude of the convergence degradation depend on the bit length of significand. On the other hand, the proposed data conversion method is shown to fully avoid the convergence degradation, and robust mixed precision computing is enabled for the Jacobi preconditioner without extra overheads.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"84 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127976838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}