{"title":"Multi-accelerator extension in OpenMP based on PGAS model","authors":"M. Nakao, H. Murai, M. Sato","doi":"10.1145/3293320.3293324","DOIUrl":"https://doi.org/10.1145/3293320.3293324","url":null,"abstract":"Many systems used in HPC field have multiple accelerators on a single compute node. However, programming for multiple accelerators is more difficult than that for a single accelerator. Therefore, in this paper, we propose an OpenMP extension that allows easy programming for multiple accelerators. We extend existing OpenMP syntax to create Partitioned Global Address Space (PGAS) on separated memories of several accelerators. The feature enables users to perform programming to use multiple accelerators in ease. In performance evaluation, we implement the STREAM Triad and the HIMENO benchmarks using the proposed OpenMP extension. As a result of evaluating the performance on a compute node equipped with up to four GPUs, we confirm that the proposed OpenMP extension demonstrates sufficient performance.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129702114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Method for Order/Degree Problem Based on Graph Symmetry and Simulated Annealing with MPI/OpenMP Parallelization","authors":"M. Nakao, H. Murai, M. Sato","doi":"10.1145/3293320.3293325","DOIUrl":"https://doi.org/10.1145/3293320.3293325","url":null,"abstract":"The network topology in various systems, such as large-scale data centers, high-performance computing systems, and Network on Chip, is strongly related to network latency. Designing a network topology with low latency can be defined as an order/degree problem (ODP) in graph theory by modeling the network topology as an undirected graph. This study proposes a method for efficiently solving ODPs based on graph symmetry and simulated annealing (SA). This method makes the network topology symmetrical, thereby improving the solution search performance of SA and drastically reducing the calculation time. The proposed method is applied to several problems from an international competition for ODPs called Graph Golf to find network topologies with sufficiently low latency. The symmetry-based calculation achieves a speed up of 31.76 times for one of the problems. Furthermore, to reduce calculation time, the proposed method is extended to use hybrid parallelization with MPI and OpenMP. As a result, a maximum speed up of 209.80 times was achieved on 20 compute nodes consisting of 400 CPU cores. Even faster performance was achieved by combining the symmetry-based calculation and hybrid parallelization.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132469491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Shen, Tao Cui, Hui Liu, Zhouyuan Zhu, H. Zhong, Zhangxin Chen, Bo Yang, Ruijian He, Huaqing Liu
{"title":"Numerical Simulation of Two-phase Flow in Naturally Fractured Reservoirs Using Dual Porosity Method on Parallel Computers: Numerical Simulation of Two-phase Flow in Naturally Fractured Reservoirs","authors":"L. Shen, Tao Cui, Hui Liu, Zhouyuan Zhu, H. Zhong, Zhangxin Chen, Bo Yang, Ruijian He, Huaqing Liu","doi":"10.1145/3293320.3293322","DOIUrl":"https://doi.org/10.1145/3293320.3293322","url":null,"abstract":"The two-phase oil-water flow in naturally fractured reservoirs and its numerical methods are introduced in this paper, where the fractured reservoirs are modeled by the dual porosity method. An efficient numerical scheme, including the finite difference (volume) method, CPR-FPF preconditioners for linear systems and effective decoupling methods, is presented. Parallel computing techniques employed in simulation of the two-phase flow are also presented. Using these numerical scheme and parallel techniques, a parallel reservoir simulator is developed, which is capable of simulating large-scale reservoir models. The numerical results show that this simulator is accurate and scalable compared to the commercial software and the numerical scheme is also effective.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125681776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable communication performance prediction using auto-generated pseudo MPI event trace","authors":"Miwako Tsuji, T. Boku, M. Sato","doi":"10.1145/3293320.3293323","DOIUrl":"https://doi.org/10.1145/3293320.3293323","url":null,"abstract":"For the co-design of HPC systems and applications, it is important to study how application performance is affected by the characteristics of the future systems, not just on a computation node but also for the parallel processing including inter-node communications. Trace-driven network simulators have been widely used because of its simplicity. However, they require the trace files corresponding to the simulated system size. Therefore, if a future system is larger than a current system, we can not adopt the trace files directly; that is, it is difficult to simulate a system larger than the current system. In order to address the scaling problem in the trace-driven network simulation, we have proposed a method called SCAlable Mpi Profiler (SCAMP). The SCAMP method runs an application on a current system, obtains MPI-event trace files, copies and edits the real trace files to create a large amount of pseudo MPI-event trace files for a future system, and finally drives a network simulator by inputting the pseudo MPI-event trace files. We also implemented a pseudo MPI-event trace file generator based on the analysis of LLVM's intermediate representations. We aim to easily obtain a first-order approximation of the communication performances for various network configurations and applications. In this paper, we describe the SCAMP system design and implementation as well as several performance evaluation results.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130334610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Extended Roofline Model with Communication-Awareness for Distributed-Memory HPC Systems","authors":"David Cardwell, Fengguang Song","doi":"10.1145/3293320.3293321","DOIUrl":"https://doi.org/10.1145/3293320.3293321","url":null,"abstract":"Performance modeling of parallel applications on distributed memory systems is a challenging task due to the effects of CPU speed, memory access time, and communication cost. In this paper, we propose a simple and intuitive graphical model, which extends the widely used Roofline performance model to include the communication cost in addition to the memory access time and the peak CPU performance. This new performance model inherits the simplicity of the original Roofline model and enables performance evaluation on a third dimension of communication performance. Such a model will greatly facilitate and expedite the analysis, development and optimization of parallel programs on high-end computer systems. We empirically validate the extended new Roofline model usingfl oating-point-computation-bound, memory-bound, and communication-bound applications. Three distinct high-end computing platforms have been tested: 1) high performance computing (HPC) systems, 2) high throughput computing systems, and 3) cloud computing systems. Our experimental results with four different parallel applications show that the new model can approximately evaluate the performance of different programs on various distributed-memory systems. Furthermore, the extended new model is able to provide insight into how the problem size can affect the upper bound performance of parallel applications, which is a special property revealed by the new dimension of communication cost analysis.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128244359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cache-efficient implementation and batching of tridiagonalization on manycore CPUs","authors":"Shuhei Kudo, Toshiyuki Imamura","doi":"10.1145/3293320.3293329","DOIUrl":"https://doi.org/10.1145/3293320.3293329","url":null,"abstract":"We herein propose an efficient implementation of tridiagonalization (TRD) for small matrices on manycore CPUs. Tridiagonalization is a matrix decomposition that is used as a preprocessor for eigenvalue computations. Further, TRD for such small matrices appears even in the HPC environment as a subproblem of large computations. To utilize the large cache memory of recent manycore CPUs, we reconstructed all parts of the implementation by introducing a systematic code generator to achieve performance portability and future extensibility. The flexibility of the system allows us to incorporate the \"BLAS+X\" approach, thereby improving the data reusability of the TRD algorithm and batching. The performance results indicate that our system outperforms the library implementations of TRD nearly twofold (or more for small matrices), on three different manycore CPUs: Fujitsu SPARC64, Intel Xeon, and Xeon Phi. As an extension, we also implemented the batching execution of TRD with a cache-aware scheduler on the top of our system. It not only doubles the peak performance at small matrices of n = O(100), but also improves it significantly up to n = O(1, 000), which is our target.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128911574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","authors":"","doi":"10.1145/3293320","DOIUrl":"https://doi.org/10.1145/3293320","url":null,"abstract":"","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121366807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512","authors":"Raehyun Kim, Jaeyoung Choi, Myungho Lee","doi":"10.1145/3293320.3293334","DOIUrl":"https://doi.org/10.1145/3293320.3293334","url":null,"abstract":"This paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121919049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed and Parallel Programming Paradigms on the K computer and a Cluster","authors":"Jérôme Gurhem, Miwako Tsuji, S. Petiton, M. Sato","doi":"10.1145/3293320.3293330","DOIUrl":"https://doi.org/10.1145/3293320.3293330","url":null,"abstract":"In this paper, we focus on a distributed and parallel programming paradigm for massively multicore supercomputers. We introduce YML, a development and execution environment for parallel and distributed applications based on a graph of task components scheduled at runtime and optimized for several middlewares. Then we show why YML may be well adapted to applications running on a lot of cores. The tasks are developed with the PGAS language XMP based on directives. We use YML/XMP to implement the block-wise Gaussian elimination to solve linear systems. We also implemented it with XMP and MPI without blocks. ScaLAPACK was also used to created an non-block implementation of the resolution of a dense linear system through LU factorization. Furthermore, we run it with different amount of blocks and number of processes per task. We find out that a good compromise between the number of blocks and the number of processes per task gives interesting results. YML/XMP obtains results faster than XMP on the K computer and close to XMP, MPI and ScaLAPACK on clusters of CPUs. We conclude that parallel and distributed multilevel programming paradigms like YML/XMP may be interesting solutions for extreme scale computing.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122368470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Portability of CPU-Accelerated Applications via Automated Source-to-Source Translation","authors":"P. Sathre, M. Gardner, Wu-chun Feng","doi":"10.1145/3293320.3293338","DOIUrl":"https://doi.org/10.1145/3293320.3293338","url":null,"abstract":"Over the past decade, accelerator-based supercomputers have grown from 0% to 42% performance share on the TOP500. Ideally, GPU-accelerated code on such systems should be \"write once, run anywhere,\" regardless of the GPU device (or for that matter, any parallel device, e.g., CPU or FPGA). In practice, however, portability can be significantly more limited due to the sheer volume of code implemented in non-portable languages. For example, the tremendous success of CUDA, as evidenced by the vast cornucopia of CUDA-accelerated applications, makes it infeasible to manually rewrite all these applications to achieve portability. Consequently, we achieve portability by using our automated CUDA-to-OpenCL source-to-source translator called CU2CL. To demonstrate the state of the practice, we use CU2CL to automatically translate three medium-to-large, CUDA-optimized codes to OpenCL, thus enabling the codes to run on other GPU-accelerated systems (as well as CPU- or FPGA-based systems). These automatically translated codes deliver performance portability, including as much as three-fold performance improvement, on a GPU device not supported by CUDA.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125815811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}