Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献

筛选
英文 中文
Fault Tolerance for Ensemble-based Molecular-Continuum Flow Simulations 基于集成的分子连续流模拟容错
Vahid Jafari, Philipp Neumann
{"title":"Fault Tolerance for Ensemble-based Molecular-Continuum Flow Simulations","authors":"Vahid Jafari, Philipp Neumann","doi":"10.1145/3578178.3578220","DOIUrl":"https://doi.org/10.1145/3578178.3578220","url":null,"abstract":"Molecular dynamics (MD) simulations exhibit big computational efforts, which makes them very time-consuming. This particularly holds for molecular-continuum simulations in fluid dynamics, which rely on the simulation of MD ensembles that are coupled to computational fluid dynamics (CFD) solvers. Massively parallel implementations for MD simulations and the respective ensembles are therefore of utmost importance. However, the more processors are used for the molecular-continuum simulation, the higher the probability of software- and hardware-induced failures or malfunctions of one processor becomes, which may lead to the issue that the entire simulation crashes. To avoid long re-calculation times for the simulation, a fault tolerance mechanism is required, especially considering respective simulations carried out at the exascale. In this paper, we introduce a fault tolerance method for molecular-continuum simulations implemented in the macro-micro-coupling tool (MaMiCo), an open-source coupling tool for such multiscale simulations which allows the re-use of one’s favorite MD and CFD solvers. The method makes use of a dynamic ensemble handling approach that has been used previously to estimate statistical errors due to thermal fluctuations in the MD ensemble. The dynamic ensemble is always homogeneously distributed and, thus, balanced on the computational resources to minimize the overall induced overhead overhead. The method further relies on an MPI implementation with fault tolerance support. We report scalability results with and without modeled system failures on three TOP500 supercomputers—Fugaku/RIKEN with ARM technology, Hawk/HLRS with AMD EPYC technology and HSUper/Helmut Schmidt University with Intel Icelake processors—to demonstrate the feasibility of our approach.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"45 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130738540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LibCOS: Enabling Converged HPC and Cloud Data Stores with MPI LibCOS:通过MPI实现融合HPC和云数据存储
Daniel Araújo De Medeiros, S. Markidis, Ivy Bo Peng
{"title":"LibCOS: Enabling Converged HPC and Cloud Data Stores with MPI","authors":"Daniel Araújo De Medeiros, S. Markidis, Ivy Bo Peng","doi":"10.1145/3578178.3578236","DOIUrl":"https://doi.org/10.1145/3578178.3578236","url":null,"abstract":"Recently, federated HPC and cloud resources are becoming increasingly strategic for providing diversified and geographically available computing resources. However, accessing data stores across HPC and cloud storage systems is challenging. Many cloud providers use object storage systems to support their clients in storing and retrieving data over the internet. One popular method is REST APIs atop the HTTP protocol, with Amazon’s S3 APIs being supported by most vendors. In contrast, HPC systems are contained within their networks and tend to use parallel file systems with POSIX-like interfaces. This work addresses the challenge of diverse data stores on HPC and cloud systems by providing native object storage support through the unified MPI I/O interface in HPC applications. In particular, we provide a prototype library called LibCOS that transparently enables MPI applications running on HPC systems to access object storage on remote cloud systems. We evaluated LibCOS on a Ceph object storage system and a traditional HPC system. In addition, we conducted performance characterization of core S3 operations that enable individual and collective MPI I/O. Our evaluation in HACC, IOR, and BigSort shows that enabling diverse data stores on HPC and Cloud storage is feasible and can be transparently achieved through the widely adopted MPI I/O. Also, we show that a native object storage system like Ceph could improve the scalability of I/O operations in parallel applications.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"47 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132242339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Parallelization of All-Pairs-Shortest-Path Algorithms in Unweighted Graph 无权图中全对最短路径算法的并行化
M. Nakao, H. Murai, M. Sato
{"title":"Parallelization of All-Pairs-Shortest-Path Algorithms in Unweighted Graph","authors":"M. Nakao, H. Murai, M. Sato","doi":"10.1145/3368474.3368478","DOIUrl":"https://doi.org/10.1145/3368474.3368478","url":null,"abstract":"The design of the network topology of a large-scale parallel computer system can be represented as an order/degree problem in graph theory. To solve the order/degree problem, it is necessary to obtain an all-pairs-shortest-path (APSP) of the graph. Thus, this paper evaluates two parallel algorithms that quickly find the APSP in unweighted graphs and compares their performance. The first APSP algorithm is based on the breadth-first search (BFS-APSP) and the second is based on the adjacency matrix (ADJ-APSP). First, we develop serial algorithms and threaded algorithms using OpenMP, and show that ADJ-APSP is up to 32.34 times faster than BFS-APSP. Next, we develop hybrid-parallel algorithms using OpenMP and MPI, and show that BFS-APSP is faster than ADJ-APSP under certain conditions because the maximum number of processes in BFS-APSP is greater than in ADJ-APSP. In addition, we parallelize ADJ-APSP using a single GPU (NVIDIA Tesla V100) and achieve a speed increase of up to 16.53-fold compared to that of a single CPU. Finally, we evaluate the performance of the algorithms using 128 GPUs and achieve a computation time 101.10 times faster than that using a single GPU. Moreover, it is shown that the calculation time of both algorithms can be greatly reduced when the input graphs are symmetric.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122096175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Extended Hoeffding Adaptive Tree based-Server Load Prediction in Cloud Computing environment 基于扩展Hoeffding自适应树的云计算环境下服务器负载预测
Hajer Toumi, Zaki Brahmi, M. Gammoudi
{"title":"Extended Hoeffding Adaptive Tree based-Server Load Prediction in Cloud Computing environment","authors":"Hajer Toumi, Zaki Brahmi, M. Gammoudi","doi":"10.1145/3368474.3368475","DOIUrl":"https://doi.org/10.1145/3368474.3368475","url":null,"abstract":"Cloud Computing (CC) enables client-server relationship in order to release users from computational and storage responsibility. As multi-tenant environment, Cloud providers are dealing, in one hand, with multiple concurrent users each of which exhibits a different and variable behavior over time and in the other hand, with a performance interference due to the co-location of multiple virtual machines (VMs) in the same server. Therefore, a real time server load prediction is needed in order to ensure efficient resource provisioning. While classical data mining based techniques suffer from important evaluation time and are enable to react to changes as it arrives, stream mining techniques can provide a real time prediction and changes detection. Thus, in this paper we used a well known stream mining technique, Hoeffding Adaptive Tree (HAT), in order to provide real time server load prediction. The aim of our proposed technique is to detect and react on the fly to different kind of changes that can affect the server load. Therefore, we augmented HAT by ensemble drift detectors in order to produce more accurate prediction. In order to evaluate our proposed technique HAT-ADS, we first compared it with a well known load prediction technique based on Bayesian approach. Then we compared our solution with another HAT based techniques. Overall, The experimentation showed that HAT-ADS proved important flexibility to various types of changes providing high accuracy with quick evaluation time and small memory footprint.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128913172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Enhancing a manycore-oriented compressed cache for GPGPU 增强面向GPGPU的多核压缩缓存
Keitarou Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue
{"title":"Enhancing a manycore-oriented compressed cache for GPGPU","authors":"Keitarou Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue","doi":"10.1145/3368474.3368491","DOIUrl":"https://doi.org/10.1145/3368474.3368491","url":null,"abstract":"GPUs can achieve high performance by exploiting massive-thread parallelism. However, some factors limit performance on GPUs, one of which is the negative effects of L1 cache misses. In some applications, GPUs are likely to suffer from L1 cache conflicts because a large number of cores share a small L1 cache capacity. A cache architecture that is based on data compression is a strong candidate for solving this problem as it can reduce the number of cache misses. Unlike previous studies, our data compression scheme attempts to exploit the value locality existing within not only intra cache lines but also inter cache lines. We enhance the structure of a last-level compression cache proposed for general purpose manycore processors to optimize against shared L1 caches on GPUs. The experimental results reveal that our proposal outperforms the other compression cache for GPUs by 11 points on average.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"415 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132415120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning 重新思考异步求解器在分布式深度学习中的价值
Arissa Wongpanich, Yang You, J. Demmel
{"title":"Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning","authors":"Arissa Wongpanich, Yang You, J. Demmel","doi":"10.1145/3368474.3368498","DOIUrl":"https://doi.org/10.1145/3368474.3368498","url":null,"abstract":"In recent years, the field of machine learning has seen significant advances as data becomes more abundant and deep learning models become larger and more complex. However, these improvements in accuracy [2] have come at the cost of longer training time. As a result, state-of-the-art models like OpenAI's GPT-2 [18] or AlphaZero [20] require the use of distributed systems or clusters in order to speed up training. Currently, there exist both asynchronous and synchronous solvers for distributed training. In this paper, we implement state-of-the-art asynchronous and synchronous solvers, then conduct a comparison between them to help readers pick the most appropriate solver for their own applications. We address three main challenges: (1) implementing asynchronous solvers that can outperform six common algorithm variants, (2) achieving state-of-the-art distributed performance for various applications with different computational patterns, and (3) maintaining accuracy for large-batch asynchronous training. For asynchronous algorithms, we implement an algorithm called EA-wild, which combines the idea of non-locking wild updates from Hogwild! [19] with EASGD. Our implementation is able to scale to 217,600 cores and finish 90 epochs of training the ResNet-50 model on ImageNet in 15 minutes (the baseline takes 29 hours on eight NVIDIA P100 GPUs). We conclude that more complex models (e.g., ResNet-50) favor synchronous methods, while our asynchronous solver outperforms the synchronous solver for models with a low computation-communication ratio. The results are documented in this paper; for more results, readers can refer to our supplemental website 1.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122767617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Tiling-Based Programming Model for Structured Grids on GPU Clusters GPU集群上结构化网格的平铺编程模型
Burak Bastem, D. Unat
{"title":"Tiling-Based Programming Model for Structured Grids on GPU Clusters","authors":"Burak Bastem, D. Unat","doi":"10.1145/3368474.3368485","DOIUrl":"https://doi.org/10.1145/3368474.3368485","url":null,"abstract":"Currently, more than 25% of supercomputers employ GPUs due to their massively parallel and power-efficient architectures. However, programming GPUs efficiently in a large scale system is a demanding task not only for computational scientists but also for programming experts as multi-GPU programming requires managing distinct address spaces, generating GPU-specific code and handling inter-device communication. To ease the programming effort, we propose a tiling-based high-level GPU programming model for structured grid problems. The model abstracts data decomposition, memory management and generation of GPU specific code, and hides all types of data transfer overheads. We demonstrate the effectiveness of the programming model on a heat simulation and a real-life cardiac modeling on a single GPU, on a single node with multiple-GPUs and multiple-nodes with multiple-GPUs. We also present performance comparisons under different hardware and software configurations. The results show that the programming model successfully overlaps communication and provides good speedup on 192 GPUs.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123185041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Performance Improvement of a Scalable High-Order Compressible Flow Solver on Unstructured Hexahedral Grids 非结构六面体网格上可伸缩高阶可压缩流求解器的性能改进
Kazuma Tago, T. Haga, S. Tsutsumi, R. Takaki
{"title":"Performance Improvement of a Scalable High-Order Compressible Flow Solver on Unstructured Hexahedral Grids","authors":"Kazuma Tago, T. Haga, S. Tsutsumi, R. Takaki","doi":"10.1145/3368474.3368480","DOIUrl":"https://doi.org/10.1145/3368474.3368480","url":null,"abstract":"This paper describes LS-FLOW-HO, a high-order compressible flow solver based on the Flux Reconstruction(FR) method, and its performance optimization. The Flux Reconstruction method achieves arbitrary high-order accuracy on unstructured grids and is suitable for many core architectures because of the local data sets (Stencil) involved in spatial discretization. This study focuses on the performance optimization of the PRIMEHPC FX100, a Fujitsu scalar supercomputer. First, the execution time of sample code that uses the BLAS library is compared with that of code that uses a sparse matrix multiplication which calculates only non-zero values. It is found that the sparse matrix multiplication takes less time than using DGEMM for hexahedral elements when the degree of interpolation polynomial is higher than 2. Using sparse matrix multiplication, hot spot tuning was done by extracting each subroutine code from LS-FLOW-HO. The speedup was confirmed by changing the array structure in the cell boundary, improving the memory/cache access latency by the sequential memory access, and increasing loop length by loop collapsing. Applying these tunings to LS-FLOW-HO, execution time was reduced by up to 40%, and reached 10.23% of the theoretical FLOPS peak using 16 threads of OpenMP on a single node. The performance on Intel Haswell was also shown as the execution time is reduced by about 49%. It was confirmed that the proposed techniques are effective on other processors. Finally, sustained strong scaling performance for real application to supersonic jets is demonstrated using 32 to 3200 nodes (1024 to 102400 cores).","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125321955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Effect of an Incentive Implementation for Specifying Accurate Walltime in Job Scheduling 激励实施对作业调度中精确超时的影响
Shin'ichiro Takizawa, Ryousei Takano
{"title":"Effect of an Incentive Implementation for Specifying Accurate Walltime in Job Scheduling","authors":"Shin'ichiro Takizawa, Ryousei Takano","doi":"10.1145/3368474.3368490","DOIUrl":"https://doi.org/10.1145/3368474.3368490","url":null,"abstract":"Backfill is a widely adopted scheduling technique in shared large scale systems. Accurate estimates of walltime of jobs benefit both users and operators of such systems because backfill uses the estimated walltime for scheduling decisions. However, reports on the accuracy analyses have shown that the accuracy is very low, which causes low utilization and long wait time. To overcome this situation, we propose to implement incentives for users to request accurate walltime in scheduling policy. We introduce a measure, named WRSA (Walltime Request Specification Accuracy), which represents the accuracy of requested walltime of each user and propose WRSA-aware backfill where jobs submitted by users with high WRSA are prioritized in scheduling. Through simulation using synthetic and real workloads, we confirm that utilization is improved up to 30% and the incentive for specifying accurate walltime is also improved against existing methods.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"46 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116650938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
The Effectiveness of Low-Precision Floating Arithmetic on Numerical Codes: A Case Study on Power Consumption 低精度浮点运算对数字编码的有效性:以功耗为例
Ryuichi Sakamoto, Masaaki Kondo, K. Fujita, T. Ichimura, K. Nakajima
{"title":"The Effectiveness of Low-Precision Floating Arithmetic on Numerical Codes: A Case Study on Power Consumption","authors":"Ryuichi Sakamoto, Masaaki Kondo, K. Fujita, T. Ichimura, K. Nakajima","doi":"10.1145/3368474.3368492","DOIUrl":"https://doi.org/10.1145/3368474.3368492","url":null,"abstract":"The low-precision floating point arithmetic that performs computation by reducing numerical accuracy with narrow bit-width is attracting since it can improve the performance of the numerical programs. Small memory footprint, faster computing speed, and energy saving are expected by performing calculation with low precision data. However, there have not been many studies on how low-precision arithmetics affects power and energy consumption of numerical codes. In this study, we investigate the power efficiency improvement by aggressively using low-precision arithmetics for HPC applications. In our evaluations, we analyze power characteristics of the Poisson's equation and the ground motion simulation programs with double precision and single precision floating point arithmetics. We confirm that energy efficiency improves by using low-precision arithmetics but it is heavily influenced by parameters such as data division and the number of OpenMP threads.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"220 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128173034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信