Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献_第7页

Multi-accelerator extension in OpenMP based on PGAS model 基于PGAS模型的OpenMP多加速器扩展

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293324

M. Nakao, H. Murai, M. Sato

引用次数: 2

A Method for Order/Degree Problem Based on Graph Symmetry and Simulated Annealing with MPI/OpenMP Parallelization 一种基于图对称和MPI/OpenMP并行模拟退火的阶/度问题求解方法

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293325

M. Nakao, H. Murai, M. Sato

{"title":"A Method for Order/Degree Problem Based on Graph Symmetry and Simulated Annealing with MPI/OpenMP Parallelization","authors":"M. Nakao, H. Murai, M. Sato","doi":"10.1145/3293320.3293325","DOIUrl":"https://doi.org/10.1145/3293320.3293325","url":null,"abstract":"The network topology in various systems, such as large-scale data centers, high-performance computing systems, and Network on Chip, is strongly related to network latency. Designing a network topology with low latency can be defined as an order/degree problem (ODP) in graph theory by modeling the network topology as an undirected graph. This study proposes a method for efficiently solving ODPs based on graph symmetry and simulated annealing (SA). This method makes the network topology symmetrical, thereby improving the solution search performance of SA and drastically reducing the calculation time. The proposed method is applied to several problems from an international competition for ODPs called Graph Golf to find network topologies with sufficiently low latency. The symmetry-based calculation achieves a speed up of 31.76 times for one of the problems. Furthermore, to reduce calculation time, the proposed method is extended to use hybrid parallelization with MPI and OpenMP. As a result, a maximum speed up of 209.80 times was achieved on 20 compute nodes consisting of 400 CPU cores. Even faster performance was achieved by combining the symmetry-based calculation and hybrid parallelization.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132469491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Numerical Simulation of Two-phase Flow in Naturally Fractured Reservoirs Using Dual Porosity Method on Parallel Computers: Numerical Simulation of Two-phase Flow in Naturally Fractured Reservoirs 基于并行计算机双孔隙度法的天然裂缝性储层两相流数值模拟——天然裂缝性储层两相流数值模拟

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293322

L. Shen, Tao Cui, Hui Liu, Zhouyuan Zhu, H. Zhong, Zhangxin Chen, Bo Yang, Ruijian He, Huaqing Liu

引用次数: 3

Scalable communication performance prediction using auto-generated pseudo MPI event trace 使用自动生成的伪MPI事件跟踪进行可扩展的通信性能预测

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293323

Miwako Tsuji, T. Boku, M. Sato

{"title":"Scalable communication performance prediction using auto-generated pseudo MPI event trace","authors":"Miwako Tsuji, T. Boku, M. Sato","doi":"10.1145/3293320.3293323","DOIUrl":"https://doi.org/10.1145/3293320.3293323","url":null,"abstract":"For the co-design of HPC systems and applications, it is important to study how application performance is affected by the characteristics of the future systems, not just on a computation node but also for the parallel processing including inter-node communications. Trace-driven network simulators have been widely used because of its simplicity. However, they require the trace files corresponding to the simulated system size. Therefore, if a future system is larger than a current system, we can not adopt the trace files directly; that is, it is difficult to simulate a system larger than the current system. In order to address the scaling problem in the trace-driven network simulation, we have proposed a method called SCAlable Mpi Profiler (SCAMP). The SCAMP method runs an application on a current system, obtains MPI-event trace files, copies and edits the real trace files to create a large amount of pseudo MPI-event trace files for a future system, and finally drives a network simulator by inputting the pseudo MPI-event trace files. We also implemented a pseudo MPI-event trace file generator based on the analysis of LLVM's intermediate representations. We aim to easily obtain a first-order approximation of the communication performances for various network configurations and applications. In this paper, we describe the SCAMP system design and implementation as well as several performance evaluation results.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130334610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An Extended Roofline Model with Communication-Awareness for Distributed-Memory HPC Systems 基于通信感知的分布式内存高性能计算系统扩展rooline模型

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293321

David Cardwell, Fengguang Song

{"title":"An Extended Roofline Model with Communication-Awareness for Distributed-Memory HPC Systems","authors":"David Cardwell, Fengguang Song","doi":"10.1145/3293320.3293321","DOIUrl":"https://doi.org/10.1145/3293320.3293321","url":null,"abstract":"Performance modeling of parallel applications on distributed memory systems is a challenging task due to the effects of CPU speed, memory access time, and communication cost. In this paper, we propose a simple and intuitive graphical model, which extends the widely used Roofline performance model to include the communication cost in addition to the memory access time and the peak CPU performance. This new performance model inherits the simplicity of the original Roofline model and enables performance evaluation on a third dimension of communication performance. Such a model will greatly facilitate and expedite the analysis, development and optimization of parallel programs on high-end computer systems. We empirically validate the extended new Roofline model usingfl oating-point-computation-bound, memory-bound, and communication-bound applications. Three distinct high-end computing platforms have been tested: 1) high performance computing (HPC) systems, 2) high throughput computing systems, and 3) cloud computing systems. Our experimental results with four different parallel applications show that the new model can approximately evaluate the performance of different programs on various distributed-memory systems. Furthermore, the extended new model is able to provide insight into how the problem size can affect the upper bound performance of parallel applications, which is a special property revealed by the new dimension of communication cost analysis.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128244359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Cache-efficient implementation and batching of tridiagonalization on manycore CPUs 多核cpu上三对角化的缓存高效实现和批处理

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293329

Shuhei Kudo, Toshiyuki Imamura

{"title":"Cache-efficient implementation and batching of tridiagonalization on manycore CPUs","authors":"Shuhei Kudo, Toshiyuki Imamura","doi":"10.1145/3293320.3293329","DOIUrl":"https://doi.org/10.1145/3293320.3293329","url":null,"abstract":"We herein propose an efficient implementation of tridiagonalization (TRD) for small matrices on manycore CPUs. Tridiagonalization is a matrix decomposition that is used as a preprocessor for eigenvalue computations. Further, TRD for such small matrices appears even in the HPC environment as a subproblem of large computations. To utilize the large cache memory of recent manycore CPUs, we reconstructed all parts of the implementation by introducing a systematic code generator to achieve performance portability and future extensibility. The flexibility of the system allows us to incorporate the \"BLAS+X\" approach, thereby improving the data reusability of the TRD algorithm and batching. The performance results indicate that our system outperforms the library implementations of TRD nearly twofold (or more for small matrices), on three different manycore CPUs: Fujitsu SPARC64, Intel Xeon, and Xeon Phi. As an extension, we also implemented the batching execution of TRD with a cache-aware scheduler on the top of our system. It not only doubles the peak performance at small matrices of n = O(100), but also improves it significantly up to n = O(1, 000), which is our target.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128911574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 亚太地区高性能计算国际会议论文集

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320

引用次数: 0

Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512 使用Intel AVX-512自动调优优化并行GEMM例程

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293334

Raehyun Kim, Jaeyoung Choi, Myungho Lee

引用次数: 19

Distributed and Parallel Programming Paradigms on the K computer and a Cluster K计算机和集群上的分布式和并行编程范式

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293330

Jérôme Gurhem, Miwako Tsuji, S. Petiton, M. Sato

{"title":"Distributed and Parallel Programming Paradigms on the K computer and a Cluster","authors":"Jérôme Gurhem, Miwako Tsuji, S. Petiton, M. Sato","doi":"10.1145/3293320.3293330","DOIUrl":"https://doi.org/10.1145/3293320.3293330","url":null,"abstract":"In this paper, we focus on a distributed and parallel programming paradigm for massively multicore supercomputers. We introduce YML, a development and execution environment for parallel and distributed applications based on a graph of task components scheduled at runtime and optimized for several middlewares. Then we show why YML may be well adapted to applications running on a lot of cores. The tasks are developed with the PGAS language XMP based on directives. We use YML/XMP to implement the block-wise Gaussian elimination to solve linear systems. We also implemented it with XMP and MPI without blocks. ScaLAPACK was also used to created an non-block implementation of the resolution of a dense linear system through LU factorization. Furthermore, we run it with different amount of blocks and number of processes per task. We find out that a good compromise between the number of blocks and the number of processes per task gives interesting results. YML/XMP obtains results faster than XMP on the K computer and close to XMP, MPI and ScaLAPACK on clusters of CPUs. We conclude that parallel and distributed multilevel programming paradigms like YML/XMP may be interesting solutions for extreme scale computing.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122368470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

On the Portability of CPU-Accelerated Applications via Automated Source-to-Source Translation 通过自动源到源转换的cpu加速应用程序的可移植性

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293338

P. Sathre, M. Gardner, Wu-chun Feng

{"title":"On the Portability of CPU-Accelerated Applications via Automated Source-to-Source Translation","authors":"P. Sathre, M. Gardner, Wu-chun Feng","doi":"10.1145/3293320.3293338","DOIUrl":"https://doi.org/10.1145/3293320.3293338","url":null,"abstract":"Over the past decade, accelerator-based supercomputers have grown from 0% to 42% performance share on the TOP500. Ideally, GPU-accelerated code on such systems should be \"write once, run anywhere,\" regardless of the GPU device (or for that matter, any parallel device, e.g., CPU or FPGA). In practice, however, portability can be significantly more limited due to the sheer volume of code implemented in non-portable languages. For example, the tremendous success of CUDA, as evidenced by the vast cornucopia of CUDA-accelerated applications, makes it infeasible to manually rewrite all these applications to achieve portability. Consequently, we achieve portability by using our automated CUDA-to-OpenCL source-to-source translator called CU2CL. To demonstrate the state of the practice, we use CU2CL to automatically translate three medium-to-large, CUDA-optimized codes to OpenCL, thus enabling the codes to run on other GPU-accelerated systems (as well as CPU- or FPGA-based systems). These automatically translated codes deliver performance portability, including as much as three-fold performance improvement, on a GPU device not supported by CUDA.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125815811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15