Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献

筛选
英文 中文
Multiplicative Schwartz-Type Block Multi-Color Gauss-Seidel Smoother for Algebraic Multigrid Methods 代数多重网格方法的乘法Schwartz-Type块多色Gauss-Seidel光滑
Masatoshi Kawai, Akihiro Ida, Hiroya Matsuba, K. Nakajima, M. Bolten
{"title":"Multiplicative Schwartz-Type Block Multi-Color Gauss-Seidel Smoother for Algebraic Multigrid Methods","authors":"Masatoshi Kawai, Akihiro Ida, Hiroya Matsuba, K. Nakajima, M. Bolten","doi":"10.1145/3368474.3368481","DOIUrl":"https://doi.org/10.1145/3368474.3368481","url":null,"abstract":"In this paper, we propose a multiplicative Schwartz-type block multi-color Gauss-Seidel (MS-BMC-GS) smoother for algebraic multigrid (AMG) methods. AMG is an excellent solver and one of the most effective preconditioners for Krylov subspace methods such as the conjugate gradient method. The achievable degree of parallelism, convergence ratio, and computational cost of AMG strongly depend on the chosen smoother. As multiple unknowns are relaxed simultaneously, the MS-BMC-GS smoother realizes higher convergence than the existing parallel Gauss-Seidel smoother. Although this increases the amount of computation, the increase in the computational time is mitigated by the high cache hit ratio owing to the novel blocking technique. Numerical experiments demonstrate that MS-BMC-GS outperforms the block multi-color GS smoother by 18%.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122631428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous Core 紧耦合异构核上的高能效提前执行
Susumu Mashimo, Ryota Shioya, Koji Inoue
{"title":"Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous Core","authors":"Susumu Mashimo, Ryota Shioya, Koji Inoue","doi":"10.1145/3368474.3368496","DOIUrl":"https://doi.org/10.1145/3368474.3368496","url":null,"abstract":"Out-of-order (OoO) processors generally offer significant performance gains over simpler in-order (InO) processors. However, recent studies have revealed that OoO processors provide little performance benefit in many program phases, and these phases are distributed in fine granularity. Leveraging these fine-grained phases, tightly coupled heterogeneous cores (TCHCs) have been proposed to improve the energy efficiency. A TCHC, which is a processor core that consists of multiple back-ends, each with different characteristics in terms of their performance and energy consumption (e.g., a power-efficient InO back-end and a high-performance OoO back-end), improves the energy efficiency by executing programs by switching to the most energy-efficient back-end with a very small switching penalty. We propose a novel technique to further improve the energy efficiency of a TCHC. The proposed technique is based on runahead execution (RAE), which is a prefetch technique that executes instructions ahead of long-latency cache misses and issues independent cache misses earlier. Leveraging the characteristics of TCHCs and RAE, the proposed technique increases the utilization of energy-efficient back-ends, thereby significantly improving the energy efficiency. Our evaluation results show that our proposed method achieves 13% of energy-delay product (EDP) over a state-of-the-art TCHC using Oracle switching decision logic.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125981718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accuracy Improvement of Memory System Simulation for Modern Shared Memory Processor 提高现代共享内存处理器内存系统仿真的精度
Yuetsu Kodama, Tetsuya Odajima, A. Asato, M. Sato
{"title":"Accuracy Improvement of Memory System Simulation for Modern Shared Memory Processor","authors":"Yuetsu Kodama, Tetsuya Odajima, A. Asato, M. Sato","doi":"10.1145/3368474.3368483","DOIUrl":"https://doi.org/10.1145/3368474.3368483","url":null,"abstract":"For the purpose of developing applications for supercomputer Fugaku at an early stage, RIKEN has developed a processor simulator. This simulator is based on the general-purpose processor simulator gem5. It does not simulate the actual hardware of a Fugaku processor. However, we believe that sufficient simulation accuracy can be obtained since it simulates the instruction pipeline of out-of-order execution with cycle-level accuracy along with performing detailed parameter tuning of out-of-order resources. In order to estimate the accurate execution time of a program, it is necessary to simulate with accuracy not only the instruction execution time, but also the access time of the cache memory hierarchy. Therefore, in the RIKEN simulator, we expanded gem5 to match the performance of the cache memory hierarchy to that of a Fugaku processor. In this simulator, we aim to estimate the execution cycles of one node application on a Fugaku processor with accuracy that enables relative evaluation and application tuning. In this paper, we show the details of the implementation of this simulator and verify its accuracy compared with that of a Fugaku processor test chip. In the evaluation of the total 46 kernel benchmarks, it was confirmed that the difference is 13% or less for 85% of the kernels. In the multithreaded execution of Stream Triad benchmark, scalable performance according to the number of threads was confirmed, and achieved over 80% of memory throughput with enough accuracy.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126400601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Dual-Plane Isomorphic Hypercube Network 双平面同构超立方体网络
T. Hosomi, Ryota Yasudo, M. Koibuchi, S. Shimojo
{"title":"Dual-Plane Isomorphic Hypercube Network","authors":"T. Hosomi, Ryota Yasudo, M. Koibuchi, S. Shimojo","doi":"10.1145/3368474.3368493","DOIUrl":"https://doi.org/10.1145/3368474.3368493","url":null,"abstract":"We propose a multi-plane isomorphic network that increases network throughput and reduces network latency by effectively configuring multi-plane networks. In the proposed network, each plane adopts the same graph topology but different switch-to-switch connections. We evaluate the dual-plane isomorphic hypercube network by graph analysis and cycle level simulation. Results of the graph analysis show that the dual-plane isomorphic 8-hypercube reduces the average shortest path length by 22% and improves throughput by 28% compared with the dual-plane hypercube. Similar improvements are confirmed from the results of the cycle level simulation. We also examine the dual-plane isomorphic folded-hypercube network. Finally, we discuss the effect of longer cable length caused by the isomorphic network on the network cost and latency.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122526727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploiting Spark for HPC Simulation Data: Taming the Ephemeral Data Explosion 利用Spark实现HPC模拟数据:驯服短暂数据爆炸
M. Jiang, Brian Gallagher, Albert Chu, G. Abdulla, Timothy Bender
{"title":"Exploiting Spark for HPC Simulation Data: Taming the Ephemeral Data Explosion","authors":"M. Jiang, Brian Gallagher, Albert Chu, G. Abdulla, Timothy Bender","doi":"10.1145/3368474.3368482","DOIUrl":"https://doi.org/10.1145/3368474.3368482","url":null,"abstract":"In this paper, we address the challenge of analyzing simulation data on HPC systems by using Apache Spark, which is a Big Data framework. One of the main problems we encountered with using Spark on HPC systems is the ephemeral data explosion, which is brought about by the curse of persistence in the Spark framework. Data persistence is essential in reducing I/O, but it comes at the cost of storage space. We show that in some cases, Spark scratch data can consume an order of magnitude more space than the input data being analyzed, leading to fatal out-of-disk errors. We investigate the real-world application of scaling machine learning algorithms to predict and analyze failures in multi-physics simulations on 76TB of data (over one trillion training examples). This problem is 2--3 orders of magnitude larger than prior work. Based on extensive experiments at scale, we provide several concrete recommendations as state-of-the-practice, and demonstrate a 7x reduction in disk utilization with negligible increases or even decreases in runtime.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133584918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Diamond matrix powers kernels 钻石矩阵幂核
Emil Vatai, U. Singhal, R. Suda
{"title":"Diamond matrix powers kernels","authors":"Emil Vatai, U. Singhal, R. Suda","doi":"10.1145/3368474.3368494","DOIUrl":"https://doi.org/10.1145/3368474.3368494","url":null,"abstract":"Matrix powers kernel calculates the vectors Akv, for k = 1, 2,..., m and they are the heart of various scientific computations, including communication avoiding iterative solvers. In this paper we propose diamond matrix powers kernel - DMPK, which has the purpose to apply the \"diamond tiling\" stencil algorithm to general matrices. It can also be considered as an extension of the PA1 and PA2 algorithms, introduced by Demmel et al. Our approach enables us to control the balance between the amount of communication avoidance and redundant computation inherently present in communication avoiding algorithms. We present a proof of concept implementation of the algorithm using MPI routines. The experiments we performed show that the control of the amount of computation and communication is achievable, and with more thorough optimisations, DMPK is a promising alternative to existing MPK approaches.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131555734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Effect of Mixed Precision Computing on H-Matrix Vector Multiplication in BEM Analysis 边界元分析中混合精度计算对h矩阵向量乘法的影响
R. Ooi, T. Iwashita, Takeshi Fukaya, Akihiro Ida, Rio Yokota
{"title":"Effect of Mixed Precision Computing on H-Matrix Vector Multiplication in BEM Analysis","authors":"R. Ooi, T. Iwashita, Takeshi Fukaya, Akihiro Ida, Rio Yokota","doi":"10.1145/3368474.3368479","DOIUrl":"https://doi.org/10.1145/3368474.3368479","url":null,"abstract":"Hierarchical Matrix (H-matrix) is an approximation technique which splits a target dense matrix into multiple submatrices, and where a selected portion of submatrices are low-rank approximated. The technique substantially reduces both time and space complexity of dense matrix vector multiplication, and hence has been applied to numerous practical problems. In this paper, we aim to accelerate the H-matrix vector multiplication by introducing mixed precision computing, where we employ both binary64 (FP64) and binary32 (FP32) arithmetic operations. We propose three methods to introduce mixed precision computing to H-matrix vector multiplication, and then evaluate them in a boundary element method (BEM) analysis. The numerical tests examine the effects of mixed precision computing, particularly on the required simulation time and rate of convergence of the iterative (BiCG-STAB) linear solver. We confirm the effectiveness of the proposed methods.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128320971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Towards Real Time Multi-robot Routing using Quantum Computing Technologies 利用量子计算技术实现实时多机器人路由
James Clark, Tristan West, Joseph Zammit, X. Guo, Luke Mason, Duncan Russell
{"title":"Towards Real Time Multi-robot Routing using Quantum Computing Technologies","authors":"James Clark, Tristan West, Joseph Zammit, X. Guo, Luke Mason, Duncan Russell","doi":"10.1145/3293320.3293333","DOIUrl":"https://doi.org/10.1145/3293320.3293333","url":null,"abstract":"In this paper, we investigate the potential for current quantum computing technologies to provide good solutions to the NP-hard problem of routing multiple robots on a grid in real time. A hybrid quantum-classical approach has been presented in detail. Classical computation is used to generate candidate paths, while quantum annealing is used to select the optimal combination of paths. This second process is generally the most time consuming when performed clasically. The performance is benchmarked classically and on a D-Wave 2000Q with up to 200 robots and has shown that producing valid solutions for the problem of multi-robot routing is achievable with the current quantum annealing technology. The current limitations of using quantum annealing are also discussed.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121796035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Acceleration of Symmetric Sparse Matrix-Vector Product using Improved Hierarchical Diagonal Blocking Format 基于改进分层对角块格式的对称稀疏矩阵向量积加速
Ryo Muro, A. Fujii, Teruo Tanaka
{"title":"Acceleration of Symmetric Sparse Matrix-Vector Product using Improved Hierarchical Diagonal Blocking Format","authors":"Ryo Muro, A. Fujii, Teruo Tanaka","doi":"10.1145/3293320.3293332","DOIUrl":"https://doi.org/10.1145/3293320.3293332","url":null,"abstract":"In the previous study, Guy et al. proposed sparse matrix-vector product (SpMV) acceleration using the Hierarchical Diagonal Blocking (HDB) format that recursively repeated partitioning, reordering, and blocking on symmetric sparse matrix. The HDB format stores sparse matrix hierarchically using tree structure. Each node of tree structure of HDB format store small sparse matrices using CSR format. In this present study, we examined two problems with the HDB format and provided a solution for each problem. First, SpMV using the HDB format has a partial dependent relationship among hierarchies. The problem with the HDB format is that the parallelism of computation decreases as the hierarchy of nodes gets closer to the root. Thus, we propose cutting of dependency using work vectors to solve this problem. Second, each node of the conventional HDB format is stored in Compressed Sparse Row (CSR) format. Block compressed Sparse Row (BSR) format often becomes faster than CSR format in SpMV performance. Thus, we evaluated the effectiveness of our proposed method with work vectors also for BSR-HDB format. In addition, we compare the performance in the general format (CSR format, BSR format) using the Intel Math Kernel Library (MKL), the conventional HDB format, and the expanded HDB format by using 22 types of sparse matrix that from various field. The results showed that the SpMV performance was highest in the HDB format that we expanded in 19 types of sparse matrix, which was 1.99 times faster than the CSR format.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115705921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Comparative benchmarking of HPC systems for GSS applications: GSS applications in the HPC ecosystem GSS应用的HPC系统比较基准测试:HPC生态系统中的GSS应用
D. Kaliszan, S. Fürst, M. Gienger, Sergiy Gogolenko, N. Meyer, S. Petruczynik
{"title":"Comparative benchmarking of HPC systems for GSS applications: GSS applications in the HPC ecosystem","authors":"D. Kaliszan, S. Fürst, M. Gienger, Sergiy Gogolenko, N. Meyer, S. Petruczynik","doi":"10.1145/3293320.3293326","DOIUrl":"https://doi.org/10.1145/3293320.3293326","url":null,"abstract":"The work undertaken in this paper was done in the Centre of Excellence for Global Systems Science (CoeGSS), an interdisciplinary project, funded by the European Commission. The project provides decision-support in the face of global challenges. It brings together HPC and global systems science. This paper presents a proposition of GSS benchmark with the aim to find the most suitable HPC architecture and the best HPC system which allows to run GSS applications effectively. The GSS provides evidence about global systems challenges, e.g. the network structure of the world economy, energy, water and food supply systems, the global financial system or the global city system, and the scientific community. The outcome of the analysis is defining a benchmark which represents the GSS environment in the best way. Three exemplary challenges were defined as pilot applications: Health Habits, Green Growth and Global Urbanisation extended with additional applications from GSS ecosystem: Iterative proportional fitting (IPF), Data rastering - a preprocessing process converting all vectorial representations of georeferenced data into raster files to be later used as simulation input, Weather Research and Forecasting (WRF) model, CMAQ/CCTM (Community Air Multiscale Quality Modelling System/The CMAQ Chemistry-Transport Mode), CM1 (Cloud Modelling), ABMS (Agent-based Modelling and Simulation), OpenSWPC (An Open-source Seismic Wave Propagation Code). The above list seems to be quite rich and reflects the real GSS world as much as possible, having in mind, for example the real-world applications availability. Additionally, the authors tested new HPC platforms based on Intel® Xeon® Gold 6140, AMD EpycTM, ARM Hi1616 and IBM Power8+. Due to the hardware availability, the testbed consisted of a limited number of nodes. This restricted the ability to provide full tests of scalability for given applications. However, this small number of available computational units (cores) can provide valuable outcome including architecture comparison for different applications based on execution times, TDPs1 and TCO2. These are the basic metrics used for providing a ranking of HPC architectures. Finally, this document is thought to be valuable information for the GSS community for future purposes and analysis to determine their specific demands as well as - in general - to help develop a mature final benchmark set reflecting the GSS environment requirements and specialty. As none of the existing benchmarks is dedicated to the GSS community, the authors decided to create one by calling it a GSS benchmark to serve and help GSS users in their future work.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124006431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信