Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献

筛选
英文 中文
Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library 减少共享内存占用,以利用张量内核的高吞吐量及其灵活的API扩展库
Hiroyuki Ootomo, Rio Yokota
{"title":"Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library","authors":"Hiroyuki Ootomo, Rio Yokota","doi":"10.1145/3578178.3578238","DOIUrl":"https://doi.org/10.1145/3578178.3578238","url":null,"abstract":"Matrix-matrix multiplication is used for various linear algebra algorithms such as matrix decomposition and tensor contraction. NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom kernel functions. The most common way to use Tensor Core is to supply the input matrices from shared memory, which has higher bandwidth than global memory. However, the Bytes-per-Flops (B/F) ratio of the shared memory and Tensor Cores is small since the performance of Tensor Cores is high. Thus, it is important to reduce the shared memory footprint for efficient Tensor Cores usage. In this paper, we analyze the simple matrix-matrix multiplication on Tensor Cores by the roofline model and figure out that the bandwidth of shared memory might be a limitation of the performance when using WMMA API. To alleviate this issue, we provide a WMMA API extension library to boost the throughput of the computation, which has two components. The first one allows for manipulating the array of registers input to Tensor Cores flexibly. We evaluate the performance improvement of this library. The outcome of our evaluation shows that our library reduces the shared memory footprint and speeds up the computation using Tensor Cores. The second one is an API for the SGEMM emulation on Tensor Cores without additional shared memory usage. We have demonstrated that the single-precision emulating batch SGEMM implementation on Tensor Cores using this library achieves 54.2 TFlop/s on A100 GPU, which outperforms the theoretical peak performance of FP32 SIMT Cores while achieving the same level of accuracy as cuBLAS. The achieved throughput can not be achieved without reducing the shared memory footprint done by our library with the same amount of register usage.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122530276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Case Study on DaCe Portability & Performance for Batched Discrete Fourier Transforms 批处理离散傅里叶变换的DaCe可移植性与性能研究
Måns I. Andersson, S. Markidis
{"title":"A Case Study on DaCe Portability & Performance for Batched Discrete Fourier Transforms","authors":"Måns I. Andersson, S. Markidis","doi":"10.1145/3578178.3578239","DOIUrl":"https://doi.org/10.1145/3578178.3578239","url":null,"abstract":"With the emergence of new computer architectures, portability and performance-portability become significant concerns for developing HPC applications. This work reports our experience and lessons learned using DaCe to create and optimize batched Discrete Fourier Transform (DFT) calculations on different single node computer systems. The batched DFT calculation is an essential component in FFT algorithms and is widely used in computer science, numerical analysis, and signal processing. We implement the batched DFT with three complex-value array data layouts and compare them with the native complex type implementation. We use DaCe, which relies on Stateful DataFlow multiGraphs (SDFG) as an intermediate representation (IR) which can be optimized through transforms and then generates code for different architectures. We present several performance results showcasing the potential of DaCe for expressing HPC applications on different computer systems.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115482359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ESSPER: Elastic and Scalable FPGA-Cluster System for High-Performance Reconfigurable Computing with Supercomputer Fugaku 基于Fugaku超级计算机的高性能可重构计算的弹性可扩展fpga集群系统
K. Sano, Atsushi Koshiba, Takaaki Miyajima, Tomohiro Ueno
{"title":"ESSPER: Elastic and Scalable FPGA-Cluster System for High-Performance Reconfigurable Computing with Supercomputer Fugaku","authors":"K. Sano, Atsushi Koshiba, Takaaki Miyajima, Tomohiro Ueno","doi":"10.1145/3578178.3579341","DOIUrl":"https://doi.org/10.1145/3578178.3579341","url":null,"abstract":"FPGA clusters have yet to be a mainstream of HPC, even for accelerators, and several challenges exist in their architecture and system organization. This work presents ESSPER, a flexible and scalable FPGA cluster prototype system for reconfigurable HPC to meet the concept of customizability, scalability, and interoperability with existing HPC systems. Based on our classification of FPGA cluster architectures, we propose a new category of FPGA clusters with a host-FPGA bridging network using software-bridged APIs for the use of remote FPGAs. We have designed, implemented, verified, and demonstrated a proof-of-concept system of ESSPER, as a functional extension of the supercomputer Fugaku.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123707707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Parallelization of Automatic Tuning for Hyperparameter Optimization of Pedestrian Route Prediction Applications using Machine Learning 基于机器学习的行人路径预测超参数优化自动调优并行化
Sorataro Fujika, Yuga Yajima, Teruo Tanaka, A. Fujii, Yuka Kato, S. Ohshima, T. Katagiri
{"title":"Parallelization of Automatic Tuning for Hyperparameter Optimization of Pedestrian Route Prediction Applications using Machine Learning","authors":"Sorataro Fujika, Yuga Yajima, Teruo Tanaka, A. Fujii, Yuka Kato, S. Ohshima, T. Katagiri","doi":"10.1145/3578178.3578235","DOIUrl":"https://doi.org/10.1145/3578178.3578235","url":null,"abstract":"We study software automatic tuning. Automatic tuning tools using iterative one-dimensional search estimate hyperparameters of machine learning programs. Iterative one-dimensional search searches the parameter space consisting of possible values of the parameters to be tuned by repeatedly measuring and evaluating the target program. Since it takes time to train a machine learning program, estimating the optimal hyperparameters is time-consuming. Therefore, we propose a method to reduce the time required for automatic tuning by parallelization of iterative one-dimensional search. For parallelization, we use multiple job execution on a supercomputer that can utilize multiple GPUs, which is effective for machine learning. In this method, each job measures different hyperparameters. The next search point is determined by referring to the data obtained from each job. The target program is a pedestrian path prediction application. This program predicts future routes and arrival points based on past pedestrian trajectory data. The program is intended to be used in a variety of locations, and the locations and movement patterns will vary depending on the dataset used for training. We hypothesized that the estimation results of one dataset could be used for automatic tuning of another dataset, thereby reducing the time required for automatic tuning. Experimental results confirm that the parallelized iterative one-dimensional search reduces the estimation time from 89.5 hours to 4 hours compared to the sequential search. We also show that the iterative one-dimensional search efficiently investigates the point at which the performance index improves. Moreover, the hyperparameters estimated for one data set are used as the initial point for the search and automatic tuning for another data set. Compared to the results of automatic tuning with the currently used hyperparameters as the initial values, both the number of executions and execution time were reduced.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127611260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison of Reproducible Parallel Preconditioned BiCGSTAB Algorithm Based on ExBLAS and ReproBLAS 基于ExBLAS和reblas的可重复并行预处理bicstab算法的比较
X. Lei, Tongxiang Gu, S. Graillat, Xiaowen Xu, Jing Meng
{"title":"Comparison of Reproducible Parallel Preconditioned BiCGSTAB Algorithm Based on ExBLAS and ReproBLAS","authors":"X. Lei, Tongxiang Gu, S. Graillat, Xiaowen Xu, Jing Meng","doi":"10.1145/3578178.3578234","DOIUrl":"https://doi.org/10.1145/3578178.3578234","url":null,"abstract":"Krylov subspace algorithms are important methods for solving linear systems. In order to efficiently solve large-scale linear systems, parallelism techniques are often applied. However, parallelism often enlarge the non-associativity of floating-point operations, which can lead to non-reproducibility of the computations. This paper compares the performance of the parallel preconditioned BiCGSTAB algorithm implemented with two different libraries (ExBLAS and ReproBLAS) that can ensure the reproducibility of computations. To address the effect of the compiler, we explicitly utilize the FMA instructions. Finally, numerical experiments show that based on two BLAS implementations, the BiCGSTAB algorithms are reproducible. By contrast, the BiCGSTAB algorithm based on ExBLAS is more accurate but more time-consuming than the one based on ReproBLAS.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124571618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Large Integer Multiplication with Arm SVE Instructions Arm SVE指令的高效大整数乘法
Takuya Edamatsu, D. Takahashi
{"title":"Efficient Large Integer Multiplication with Arm SVE Instructions","authors":"Takuya Edamatsu, D. Takahashi","doi":"10.1145/3578178.3578193","DOIUrl":"https://doi.org/10.1145/3578178.3578193","url":null,"abstract":"In this study, we implement large integer multiplication with the Arm Scalable Vector Extension (SVE) instructions. SVE is a single instruction, multiple data (SIMD) instruction set for the Arm AArch64 architecture. We use a reduced-radix representation technique because SIMD instructions do not retain the carry that occurs when partial products are added in large integer multiplication computations. Furthermore, we develop and implement a multiplication algorithm based on the Basecase method, which allows the application of ordinary multiplication instructions to special integers in reduced-radix representation. To evaluate performance, we compare our multiplication implementation on an A64FX processor with the GNU Multiple Precision Arithmetic Library (GMP). We show that processing with SVE was faster than GMP for multiplication with operands larger than 2,048 bits. The performance gain was up to 36%. These results suggest that SVE instructions have the potential to be faster than scalar instructions for large integer multiplication, especially for large operands.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125734974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GPU–FPGA-accelerated Radiative Transfer Simulation with Inter-FPGA Communication 基于fpga间通信的gpu - fpga加速辐射传输仿真
Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, Makito Abe, M. Umemura
{"title":"GPU–FPGA-accelerated Radiative Transfer Simulation with Inter-FPGA Communication","authors":"Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, Makito Abe, M. Umemura","doi":"10.1145/3578178.3578231","DOIUrl":"https://doi.org/10.1145/3578178.3578231","url":null,"abstract":"The complementary use of graphics processing units (GPUs) and field programmable gate arrays (FPGAs) is a major topic of interest in the high-performance computing (HPC) field. GPU–FPGA-accelerated computing is an effective tool for multiphysics simulations, which encompass multiple physical models and simultaneous physical phenomena. Because the constituent operations in multiphysics simulations exhibit varying characteristics, accelerating these operations solely using GPUs is often challenging. Hence, FPGAs are frequently implemented for this purpose. The objective of the present study was to further improve application performance by employing both GPUs and FPGAs in a complementary manner. Recently, this approach has been applied to the radiative transfer simulation code for astrophysics known as ARGOT, with evaluation results quantitatively demonstrating the resulting improvement in performance. However, the evaluation results in question came from the use of a single node equipped with both a GPU and FPGA. In this study, we extended the GPU–FPGA-accelerated ARGOT code to operate on multiple nodes using the message passing interface (MPI) and an FPGA-to-FPGA communication technology scheme called Communication Integrated Reconfigurable CompUting System (CIRCUS). We evaluated the performance of the ARGOT code with multiple GPUs and FPGAs under weak scaling conditions, and found it to achieve up to 12.8x speedup compared to the GPU-only execution.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134416637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Effectiveness of the Oversubscribing Scheduling on Supercomputer Systems 超级计算机系统超订阅调度的有效性
Shohei Minami, Toshio Endo, Akihiro Nomura
{"title":"Effectiveness of the Oversubscribing Scheduling on Supercomputer Systems","authors":"Shohei Minami, Toshio Endo, Akihiro Nomura","doi":"10.1145/3578178.3578221","DOIUrl":"https://doi.org/10.1145/3578178.3578221","url":null,"abstract":"High responsiveness is substantial for users’ satisfaction in supercomputer systems. Recently, the use of interactive jobs in addition to traditional batch jobs is attracting attention. It is getting important to handle those jobs consolidated for responsive systems. Here we show oversubscribing scheduling, in which multiple HPC jobs share computational resources, can effectively process jobs. This paper builds the job scheduling simulator considering oversubscribing and evaluates the oversubscribing system using actual supercomputer workload trace data. While keeping the short users’ response time, our solution achieves some strengths not found in a conventional solution; benefits on normal jobs, alleviating the slowdown, and the unnecessariness of the effort of good system configuration.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126164759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fault-Tolerant LOBPCG for Nuclear CI Calculations 核CI计算的容错LOBPCG
Meiyue Shao, Dossay Oryspayev, Chao Yang, Pieter Maris, B. Cook
{"title":"Fault-Tolerant LOBPCG for Nuclear CI Calculations","authors":"Meiyue Shao, Dossay Oryspayev, Chao Yang, Pieter Maris, B. Cook","doi":"10.1145/3578178.3578240","DOIUrl":"https://doi.org/10.1145/3578178.3578240","url":null,"abstract":"Exascale computing platforms with millions of compute units and with thousands of nodes are predicted to experience frequent faults which interrupt applications’ execution. In this context resilience against faults becomes important. We examine user and software level fault mitigation strategies in a distributed LOBPCG algorithm targeting nuclear CI calculations. In particular, we present and evaluate one strategy that keeps the total number of fault-tolerant LOBPCG iterations close to that of the standard LOBPCG algorithm ran on a fault-free machine.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133921516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A new data conversion method for mixed precision Krylov solvers with FP16/BF16 Jacobi preconditioners 基于FP16/BF16 Jacobi预调节器的混合精度Krylov解算器数据转换新方法
Takuya Ina, Y. Idomura, Toshiyuki Imamura, Naoyuki Onodera
{"title":"A new data conversion method for mixed precision Krylov solvers with FP16/BF16 Jacobi preconditioners","authors":"Takuya Ina, Y. Idomura, Toshiyuki Imamura, Naoyuki Onodera","doi":"10.1145/3578178.3578222","DOIUrl":"https://doi.org/10.1145/3578178.3578222","url":null,"abstract":"Mixed precision Krylov solvers with the Jacobi preconditioner often show significant convergence degradation when the Jacobi preconditioner is computed in low precision such as FP16 and BF16. It is found that this convergence degradation is attributed to loss of diagonal dominance due to roundoff errors in data conversion. To resolve this issue, we propose a new data conversion method, which is designed to keep diagonal dominance of the original matrix data. The proposed method is tested by computing the Poisson equation using the conjugate gradient method, the general minimum residual method, and the biconjugate gradient stabilized method with the FP16/BF16 Jacobi preconditioner on NVIDIA V100 GPUs. Here, the new data conversion is implemented by switching the round-nearest, round-up, round-down, and round-towards-zero intrinsics in CUDA, and is called once before the main iteration. Therefore, the cost of the new data conversion is negligible. When the coefficients of matrix is continuously changed by scaling the linear system, the conventional data conversion based on the round-nearest intrinsic shows periodic changes of the convergence property depending on the difference of the roundoff errors between diagonal and off-diagonal coefficients. Here, the period and magnitude of the convergence degradation depend on the bit length of significand. On the other hand, the proposed data conversion method is shown to fully avoid the convergence degradation, and robust mixed precision computing is enabled for the Jacobi preconditioner without extra overheads.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"84 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127976838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信