Fabio Banchelli, Joan Vinyals-Ylla-Catala, Josep Pocurull, Marc Clascà, Kilian Peiro, Filippo Spiga, M. Garcia-Gasulla, Filippo Mantovani
{"title":"NVIDIA Grace Superchip Early Evaluation for HPC Applications","authors":"Fabio Banchelli, Joan Vinyals-Ylla-Catala, Josep Pocurull, Marc Clascà, Kilian Peiro, Filippo Spiga, M. Garcia-Gasulla, Filippo Mantovani","doi":"10.1145/3636480.3637284","DOIUrl":"https://doi.org/10.1145/3636480.3637284","url":null,"abstract":"Arm-based system in HPC are a reality since more than a decade. However, when a new chip enters the market always implies challenges, not only at ISA level, but also with regards to the SoC integration, the memory subsystem, the board integration, the node interconnection, and finally the OS and all layers of the system software (compiler and libraries). Guided by the procurement of an NVIDIA Grace HPC cluster within the deployment of MareNostrum 5, and emulating the approach of a scientist who needs to migrate its scientific research to a new HPC system, we evaluated five complex scientific applications on engineering sample nodes of NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip (CPU-only). We report intra-node and inter-node scalability and early performance results showing a speed-up between 1.3 × and 4.28 × for all codes when compared to the current generation of MareNostrum 4 powered by Intel Skylake CPUs.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"5 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wentao Liang, N. Fujita, Ryohei Kobayashi, T. Boku
{"title":"Using Intel oneAPI for Multi-hybrid Acceleration Programming with GPU and FPGA Coupling","authors":"Wentao Liang, N. Fujita, Ryohei Kobayashi, T. Boku","doi":"10.1145/3636480.3637220","DOIUrl":"https://doi.org/10.1145/3636480.3637220","url":null,"abstract":"Intel oneAPI is a programming framework that accepts various accelerators such as GPUs, FPGAs, and multi-core CPUs, with a focus on HPC applications. Users can apply their code written in a single language, DPC++, to this heterogeneous programming environment. However, in practice, it is not easy to apply to different accelerators, especially for non-Intel devices such as NVIDIA and AMD GPUs. We have successfully constructed a oneAPI environment set to utilize the single DPC++ programming to handle true multi-hetero acceleration including NVIDIA GPU and Intel FPGA simultaneously. In this paper, we will show how this is done and what kind of applications can be targeted.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"3 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MPI-Adapter2: An Automatic ABI Translation Library Builder for MPI Application Binary Portability","authors":"Shinji Sumimoto, Toshihiro Hanawa, Kengo Nakajima","doi":"10.1145/3636480.3637219","DOIUrl":"https://doi.org/10.1145/3636480.3637219","url":null,"abstract":"This paper proposes an automatic MPI ABI (Application Binary Interface) translation library builder named MPI-Adapter2. The container-based job environment is becoming widespread in computer centers. However, when a user uses the container image in another computer center, the container with MPI binary may not work because of the difference in the ABI of MPI libraries. The MPI-Adapter2 enables to building of MPI ABI translation libraries automatically from MPI libraries. MPI-Adapter2 can build MPI ABI translation libraries not only between different MPI implementations, such as Open MPI, MPICH, and Intel MPI but also between different versions of MPI implementation. We have implemented and evaluated MPI-Adapter2 among several versions of Intel MPI, MPICH, MVAPICH, and Open MPI using NAS parallel benchmarks and pHEAT-3D, and found that MPI-Adapter2 worked fine except for Open MPI ver. 4 binary on Open MPI ver. 2 on IS of NAS parallel benchmarks, because of the difference in MPI object size. We also evaluated the pHEAT-3D binary compiled by Open MPI ver.5 using MPI-Adapter2 up to 1024 processes with 128 nodes. The performance overhead between MPI-Adapter2 and Intel native evaluation was 1.3%.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"2 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Error-Energy Tradeoff in Molecular and Molecular-Continuum Fluid Simulations","authors":"Amartya Das Sharma, Ruben Horn, Philipp Neumann","doi":"10.1145/3636480.3636486","DOIUrl":"https://doi.org/10.1145/3636480.3636486","url":null,"abstract":"Energy consumption plays a crucial role when designing simulation studies. In this work, we take a step towards modelling the relationship between statistical error and energy consumption for molecular and molecular-continuum flow simulations. After revisiting statistical error analysis and run time complexities for molecular dynamics (MD) simulations, we verify the respective relationships in stand-alone short-range MD simulations. We then extend the analysis to coupled molecular-continuum simulations, including the multi-instance (i.e., MD ensemble averaging) case, and additionally analyse the impact of noise filters. Our findings suggest that Gauss filters can reduce the statistical error to a similar degree as doubling the number of MD instances would. We further use regression to derive an analytical energy consumption model that predicts energy consumption on our HPC-cluster HSUper, to achieve simulation results at a prescribed statistical error (or gain in signal-to-noise ratio, respectively). All simulations were carried out using the MD software ls1 mardyn and the molecular-continuum coupling tool MaMiCo. However, the derived models are easily transferable to other pieces of software and other HPC platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"10 18","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yutaka Watanabe, Jinpil Lee, K. Sano, T. Boku, M. Sato
{"title":"Design and Preliminary Evaluation of OpenACC Compiler for FPGA with OpenCL and Stream Processing DSL","authors":"Yutaka Watanabe, Jinpil Lee, K. Sano, T. Boku, M. Sato","doi":"10.1145/3373271.3373274","DOIUrl":"https://doi.org/10.1145/3373271.3373274","url":null,"abstract":"FPGA has emerged as one of the attractive computing devices in the post-Moore era because of its power efficiency and reconfigurability, even for future high-performance computing. We have designed an OpenACC compiler for FPGA to generate the kernel code by using stream processing Domain Specific Language (DSL) called SPGen, with OpenCL. Although, recently, the programming for FPGA has been improved dramatically by High-Level Synthesis (HLS) frameworks such as OpenCL and HLS C, yet it is still too difficult for HPC application developers, and the directive-based programming models such as OpenACC should be supported even for FPGA. OpenCL can be used as a portable intermediate code for OpenACC for FPGA. However, the generation of hardware from OpenCL is not easy to understand and therefore requires expert knowledge. SPGen is a DSL framework for generating stream processing HDL modules from the description of a dataflow graph. The advantage of our approach is that the code generation with SPGen enables more comprehensive low-level optimization in the OpenACC compiler. The preliminary evaluation results show that, for some kernels, the proposed method, which translates the OpenACC C code into OpenCL and SPGen codes, can perform optimization in the lower level more explicitly than the OpenCL-only method, which translates the OpenACC C code into the OpenCL code only. We also observed that more resources might be consumed in the proposed method. However, implementations of both methods are preliminary. We believe improving code generation will fix the problems such as high resource consumption.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124333703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Analysis of Inter-Process Interference on a Hybrid Memory System","authors":"S. Imamura, Eiji Yoshida","doi":"10.1145/3373271.3373272","DOIUrl":"https://doi.org/10.1145/3373271.3373272","url":null,"abstract":"Persistent memory (PM) is an emerging memory device that has a larger capacity and lower cost per gigabyte than conventional DRAM. Intel has released a first PM product called Optane™ DC Persistent Memory, but its performance is several times lower than that of DRAM. Therefore, it will be used in combination with DRAM to configure hybrid memory systems that can obtain both the high performance of DRAM and large capacity of PM. In this paper, we evaluate and analyze the performance interference between various types of processes that are concurrently executed on a real server platform having a hybrid memory system. Through the evaluation with a synthetic benchmark, we show that the interference on the hybrid memory system is significantly different from that on a conventional DRAM-only memory system.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126322825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Multigrid Method on Multicore/Manycore Clusters","authors":"K. Nakajima","doi":"10.1145/3373271.3373273","DOIUrl":"https://doi.org/10.1145/3373271.3373273","url":null,"abstract":"Parallel multigrid method is expected to be a useful algorithm in exascale era because of its scalability. It is widely known that overhead of coarse grid solver in parallel multigrid method is significant, if the number of MPI processes is O(104) or larger. The author proposed the hCGA for avoiding such overhead. Recently, the AM-hCGA, further optimized version of the hCGA, was proposed by the author, and its performance was evaluated on the Oakforest-PACS system (OFP) with IHK/McKernel at JCAHPC using up to 2,048 nodes of Intel Xeon Phi (Knights Landing). In the present work, developed method is also implemented to the Oakbridge-CX system (OBCX) at the University of Tokyo using up to 1,024 nodes (2,048 sockets) of Intel Xeon Platinum 8280 (Cascade Lake). Performance in weak and strong scaling are evaluated for application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM). The hCGA and the AM-hCGA provide excellent performance on both of OFP and OBCX with larger number of nodes. Especially, it achieved excellent performance in strong scaling on OBCX.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"82 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125909921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, Ayumi Nakamichi, T. Boku
{"title":"OpenCL-enabled GPU-FPGA Accelerated Computing with Inter-FPGA Communication","authors":"Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, Ayumi Nakamichi, T. Boku","doi":"10.1145/3373271.3373275","DOIUrl":"https://doi.org/10.1145/3373271.3373275","url":null,"abstract":"Field-programmable gate arrays (FPGAs) have garnered significant interest in high-performance computing research; their computational and communication capabilities have drastically improved in recent years owing to advances in semiconductor integration technologies. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL that reduce the amount of programming effort required have been developed and offered by FPGA vendors. These improvements reveal the possibility of implementing a concept that enables on-the-fly offloading of computational loads at which CPUs/GPUs perform poorly compared to FPGAs while moving data with low latency. We think that this concept is key to improving the performance of heterogeneous supercomputers that use accelerators such as the GPU. In this paper, we propose an approach for GPU--FPGA accelerated computing with the OpenCL programming framework that is based on the OpenCL-enabled GPU--FPGA DMA method and the FPGA-to-FPGA communication method. The experimental results demonstrate that our proposed method can enable GPUs and FPGAs to work together over different nodes.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115160620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","authors":"","doi":"10.1145/3373271","DOIUrl":"https://doi.org/10.1145/3373271","url":null,"abstract":"","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"204 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133941326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}