Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献_第4页

Fault Tolerance for Ensemble-based Molecular-Continuum Flow Simulations 基于集成的分子连续流模拟容错

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578220

Vahid Jafari, Philipp Neumann

{"title":"Fault Tolerance for Ensemble-based Molecular-Continuum Flow Simulations","authors":"Vahid Jafari, Philipp Neumann","doi":"10.1145/3578178.3578220","DOIUrl":"https://doi.org/10.1145/3578178.3578220","url":null,"abstract":"Molecular dynamics (MD) simulations exhibit big computational efforts, which makes them very time-consuming. This particularly holds for molecular-continuum simulations in fluid dynamics, which rely on the simulation of MD ensembles that are coupled to computational fluid dynamics (CFD) solvers. Massively parallel implementations for MD simulations and the respective ensembles are therefore of utmost importance. However, the more processors are used for the molecular-continuum simulation, the higher the probability of software- and hardware-induced failures or malfunctions of one processor becomes, which may lead to the issue that the entire simulation crashes. To avoid long re-calculation times for the simulation, a fault tolerance mechanism is required, especially considering respective simulations carried out at the exascale. In this paper, we introduce a fault tolerance method for molecular-continuum simulations implemented in the macro-micro-coupling tool (MaMiCo), an open-source coupling tool for such multiscale simulations which allows the re-use of one’s favorite MD and CFD solvers. The method makes use of a dynamic ensemble handling approach that has been used previously to estimate statistical errors due to thermal fluctuations in the MD ensemble. The dynamic ensemble is always homogeneously distributed and, thus, balanced on the computational resources to minimize the overall induced overhead overhead. The method further relies on an MPI implementation with fault tolerance support. We report scalability results with and without modeled system failures on three TOP500 supercomputers—Fugaku/RIKEN with ARM technology, Hawk/HLRS with AMD EPYC technology and HSUper/Helmut Schmidt University with Intel Icelake processors—to demonstrate the feasibility of our approach.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"45 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130738540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LibCOS: Enabling Converged HPC and Cloud Data Stores with MPI LibCOS:通过MPI实现融合HPC和云数据存储

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI: 10.1145/3578178.3578236

Daniel Araújo De Medeiros, S. Markidis, Ivy Bo Peng

{"title":"LibCOS: Enabling Converged HPC and Cloud Data Stores with MPI","authors":"Daniel Araújo De Medeiros, S. Markidis, Ivy Bo Peng","doi":"10.1145/3578178.3578236","DOIUrl":"https://doi.org/10.1145/3578178.3578236","url":null,"abstract":"Recently, federated HPC and cloud resources are becoming increasingly strategic for providing diversified and geographically available computing resources. However, accessing data stores across HPC and cloud storage systems is challenging. Many cloud providers use object storage systems to support their clients in storing and retrieving data over the internet. One popular method is REST APIs atop the HTTP protocol, with Amazon’s S3 APIs being supported by most vendors. In contrast, HPC systems are contained within their networks and tend to use parallel file systems with POSIX-like interfaces. This work addresses the challenge of diverse data stores on HPC and cloud systems by providing native object storage support through the unified MPI I/O interface in HPC applications. In particular, we provide a prototype library called LibCOS that transparently enables MPI applications running on HPC systems to access object storage on remote cloud systems. We evaluated LibCOS on a Ceph object storage system and a traditional HPC system. In addition, we conducted performance characterization of core S3 operations that enable individual and collective MPI I/O. Our evaluation in HACC, IOR, and BigSort shows that enabling diverse data stores on HPC and Cloud storage is feasible and can be transparently achieved through the widely adopted MPI I/O. Also, we show that a native object storage system like Ceph could improve the scalability of I/O operations in parallel applications.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"47 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132242339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Parallelization of All-Pairs-Shortest-Path Algorithms in Unweighted Graph 无权图中全对最短路径算法的并行化

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368478

M. Nakao, H. Murai, M. Sato

{"title":"Parallelization of All-Pairs-Shortest-Path Algorithms in Unweighted Graph","authors":"M. Nakao, H. Murai, M. Sato","doi":"10.1145/3368474.3368478","DOIUrl":"https://doi.org/10.1145/3368474.3368478","url":null,"abstract":"The design of the network topology of a large-scale parallel computer system can be represented as an order/degree problem in graph theory. To solve the order/degree problem, it is necessary to obtain an all-pairs-shortest-path (APSP) of the graph. Thus, this paper evaluates two parallel algorithms that quickly find the APSP in unweighted graphs and compares their performance. The first APSP algorithm is based on the breadth-first search (BFS-APSP) and the second is based on the adjacency matrix (ADJ-APSP). First, we develop serial algorithms and threaded algorithms using OpenMP, and show that ADJ-APSP is up to 32.34 times faster than BFS-APSP. Next, we develop hybrid-parallel algorithms using OpenMP and MPI, and show that BFS-APSP is faster than ADJ-APSP under certain conditions because the maximum number of processes in BFS-APSP is greater than in ADJ-APSP. In addition, we parallelize ADJ-APSP using a single GPU (NVIDIA Tesla V100) and achieve a speed increase of up to 16.53-fold compared to that of a single CPU. Finally, we evaluate the performance of the algorithms using 128 GPUs and achieve a computation time 101.10 times faster than that using a single GPU. Moreover, it is shown that the calculation time of both algorithms can be greatly reduced when the input graphs are symmetric.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122096175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Extended Hoeffding Adaptive Tree based-Server Load Prediction in Cloud Computing environment 基于扩展Hoeffding自适应树的云计算环境下服务器负载预测

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368475

Hajer Toumi, Zaki Brahmi, M. Gammoudi

{"title":"Extended Hoeffding Adaptive Tree based-Server Load Prediction in Cloud Computing environment","authors":"Hajer Toumi, Zaki Brahmi, M. Gammoudi","doi":"10.1145/3368474.3368475","DOIUrl":"https://doi.org/10.1145/3368474.3368475","url":null,"abstract":"Cloud Computing (CC) enables client-server relationship in order to release users from computational and storage responsibility. As multi-tenant environment, Cloud providers are dealing, in one hand, with multiple concurrent users each of which exhibits a different and variable behavior over time and in the other hand, with a performance interference due to the co-location of multiple virtual machines (VMs) in the same server. Therefore, a real time server load prediction is needed in order to ensure efficient resource provisioning. While classical data mining based techniques suffer from important evaluation time and are enable to react to changes as it arrives, stream mining techniques can provide a real time prediction and changes detection. Thus, in this paper we used a well known stream mining technique, Hoeffding Adaptive Tree (HAT), in order to provide real time server load prediction. The aim of our proposed technique is to detect and react on the fly to different kind of changes that can affect the server load. Therefore, we augmented HAT by ensemble drift detectors in order to produce more accurate prediction. In order to evaluate our proposed technique HAT-ADS, we first compared it with a well known load prediction technique based on Bayesian approach. Then we compared our solution with another HAT based techniques. Overall, The experimentation showed that HAT-ADS proved important flexibility to various types of changes providing high accuracy with quick evaluation time and small memory footprint.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128913172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Enhancing a manycore-oriented compressed cache for GPGPU 增强面向GPGPU的多核压缩缓存

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368491

Keitarou Oka, Satoshi Kawakami, Teruo Tanimoto, Takatsugu Ono, Koji Inoue

引用次数: 0

Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning 重新思考异步求解器在分布式深度学习中的价值

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368498

Arissa Wongpanich, Yang You, J. Demmel

{"title":"Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning","authors":"Arissa Wongpanich, Yang You, J. Demmel","doi":"10.1145/3368474.3368498","DOIUrl":"https://doi.org/10.1145/3368474.3368498","url":null,"abstract":"In recent years, the field of machine learning has seen significant advances as data becomes more abundant and deep learning models become larger and more complex. However, these improvements in accuracy [2] have come at the cost of longer training time. As a result, state-of-the-art models like OpenAI's GPT-2 [18] or AlphaZero [20] require the use of distributed systems or clusters in order to speed up training. Currently, there exist both asynchronous and synchronous solvers for distributed training. In this paper, we implement state-of-the-art asynchronous and synchronous solvers, then conduct a comparison between them to help readers pick the most appropriate solver for their own applications. We address three main challenges: (1) implementing asynchronous solvers that can outperform six common algorithm variants, (2) achieving state-of-the-art distributed performance for various applications with different computational patterns, and (3) maintaining accuracy for large-batch asynchronous training. For asynchronous algorithms, we implement an algorithm called EA-wild, which combines the idea of non-locking wild updates from Hogwild! [19] with EASGD. Our implementation is able to scale to 217,600 cores and finish 90 epochs of training the ResNet-50 model on ImageNet in 15 minutes (the baseline takes 29 hours on eight NVIDIA P100 GPUs). We conclude that more complex models (e.g., ResNet-50) favor synchronous methods, while our asynchronous solver outperforms the synchronous solver for models with a low computation-communication ratio. The results are documented in this paper; for more results, readers can refer to our supplemental website 1.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122767617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Performance Improvement of a Scalable High-Order Compressible Flow Solver on Unstructured Hexahedral Grids 非结构六面体网格上可伸缩高阶可压缩流求解器的性能改进

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368480

Kazuma Tago, T. Haga, S. Tsutsumi, R. Takaki

{"title":"Performance Improvement of a Scalable High-Order Compressible Flow Solver on Unstructured Hexahedral Grids","authors":"Kazuma Tago, T. Haga, S. Tsutsumi, R. Takaki","doi":"10.1145/3368474.3368480","DOIUrl":"https://doi.org/10.1145/3368474.3368480","url":null,"abstract":"This paper describes LS-FLOW-HO, a high-order compressible flow solver based on the Flux Reconstruction(FR) method, and its performance optimization. The Flux Reconstruction method achieves arbitrary high-order accuracy on unstructured grids and is suitable for many core architectures because of the local data sets (Stencil) involved in spatial discretization. This study focuses on the performance optimization of the PRIMEHPC FX100, a Fujitsu scalar supercomputer. First, the execution time of sample code that uses the BLAS library is compared with that of code that uses a sparse matrix multiplication which calculates only non-zero values. It is found that the sparse matrix multiplication takes less time than using DGEMM for hexahedral elements when the degree of interpolation polynomial is higher than 2. Using sparse matrix multiplication, hot spot tuning was done by extracting each subroutine code from LS-FLOW-HO. The speedup was confirmed by changing the array structure in the cell boundary, improving the memory/cache access latency by the sequential memory access, and increasing loop length by loop collapsing. Applying these tunings to LS-FLOW-HO, execution time was reduced by up to 40%, and reached 10.23% of the theoretical FLOPS peak using 16 threads of OpenMP on a single node. The performance on Intel Haswell was also shown as the execution time is reduced by about 49%. It was confirmed that the proposed techniques are effective on other processors. Finally, sustained strong scaling performance for real application to supersonic jets is demonstrated using 32 to 3200 nodes (1024 to 102400 cores).","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125321955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effect of an Incentive Implementation for Specifying Accurate Walltime in Job Scheduling 激励实施对作业调度中精确超时的影响

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368490

Shin'ichiro Takizawa, Ryousei Takano

引用次数: 6

The Effectiveness of Low-Precision Floating Arithmetic on Numerical Codes: A Case Study on Power Consumption 低精度浮点运算对数字编码的有效性:以功耗为例

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368492

Ryuichi Sakamoto, Masaaki Kondo, K. Fujita, T. Ichimura, K. Nakajima

引用次数: 6

Tiling-Based Programming Model for Structured Grids on GPU Clusters GPU集群上结构化网格的平铺编程模型

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368485

Burak Bastem, D. Unat

引用次数: 2