Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献_第8页

A Memory Saving Communication Method Using Remote Atomic Operations 一种使用远程原子操作的内存节省通信方法

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293328

Masaaki Fushimi, Takahiro Kawashima, Takafumi Nose, Nobutaka Ihara, S. Sumimoto, Naoyuki Shida

引用次数: 0

A Lightweight Method for Handling Control Divergence in GPGPUs 一种处理gpgpu控制发散的轻量级方法

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293331

YaoHua Yang, Shiqing Zhang, Li Shen

{"title":"A Lightweight Method for Handling Control Divergence in GPGPUs","authors":"YaoHua Yang, Shiqing Zhang, Li Shen","doi":"10.1145/3293320.3293331","DOIUrl":"https://doi.org/10.1145/3293320.3293331","url":null,"abstract":"At present, graphics processing units (GPUs) has been widely used for scientific and high performance acceleration in the general purpose computing area, which is inseparable from the SIMT (Single-Instruction, Multiple-Thread) execution model. With SIMT, GPUs can fully utilize the advantages of SIMD parallel computing. However, when threads in a warp do not follow the same execution path, control divergence generates and affects the hardware utilization. In response to this problem, warp regrouping method has been proposed to combine threads executing the same branch path, which can significantly improve thread-level parallelism. But it is found that not all warps can be regrouped effectively because that may introduce a lot of unnecessary overheads, limiting further performance improvement. In this paper, we analyze the source of overheads and propose a lightweight warp regrouping method --- Partial Warp Regrouping (PWR) that controls the scope of reorganization and avoids most of the unnecessary warp regrouping by setting thresholds. In this method, it also can reduce the complexity of hardware design. Our experimental results show that this mechanism can improve the performance by 12% on average and up to 27% compared with immediate post-dominator.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125733301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An investigation into the impact of the structured QR kernel on the overall performance of the TSQR algorithm 研究了结构化QR核对TSQR算法整体性能的影响

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2019-01-14 DOI: 10.1145/3293320.3293327

Takeshi Fukaya

引用次数: 0

Improving Collective MPI-IO Using Topology-Aware Stepwise Data Aggregation with I/O Throttling 基于I/O节流的拓扑感知逐步数据聚合改进MPI-IO

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149464

Y. Tsujita, A. Hori, Toyohisa Kameyama, Atsuya Uno, F. Shoji, Y. Ishikawa

{"title":"Improving Collective MPI-IO Using Topology-Aware Stepwise Data Aggregation with I/O Throttling","authors":"Y. Tsujita, A. Hori, Toyohisa Kameyama, Atsuya Uno, F. Shoji, Y. Ishikawa","doi":"10.1145/3149457.3149464","DOIUrl":"https://doi.org/10.1145/3149457.3149464","url":null,"abstract":"MPI-IO has been used in an internal I/O interface layer of HDF5 or PnetCDF, where collective MPI-IO plays a big role in parallel I/O to manage a huge scale of scientific data. However, existing collective MPI-IO optimization named two-phase I/O has not been tuned enough for recent supercomputers consisting of mesh/torus interconnects and a huge scale of parallel file systems due to lack of topology-awareness in data transfers and optimization for parallel file systems. In this paper, we propose I/O throttling and topology-aware stepwise data aggregation in two-phase I/O of ROMIO, which is a representative MPI-IO library, in order to improve collective MPI-IO performance even if we have multiple processes per compute node. Throttling I/O requests going to a target file system mitigates I/O request contention, and consequently I/O performance improvements are achieved in file access phase of two-phase I/O. Topology-aware aggregator layout with paying attention to multiple aggregators per compute node alleviates contention in data aggregation phase of two-phase I/O. In addition, stepwise data aggregation improves data aggregation performance. HPIO benchmark results on the K computer indicate that the proposed optimization has achieved up to about 73% and 39% improvements in write performance compared with the original implementation using 12,288 and 24,576 processes on 3,072 and 6,144 compute nodes, respectively.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115670846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Parallelized Software Offloading of Low-Level Communication with User-Level Threads 与用户级线程的低级通信的并行软件卸载

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149475

Wataru Endo, K. Taura

{"title":"Parallelized Software Offloading of Low-Level Communication with User-Level Threads","authors":"Wataru Endo, K. Taura","doi":"10.1145/3149457.3149475","DOIUrl":"https://doi.org/10.1145/3149457.3149475","url":null,"abstract":"Although recent HPC interconnects are assumed to achieve low latency and high bandwidth communication, in practical terms, their performance is often bounded by the network software stacks rather than the underlying hardware because message processing requires a certain amount of computation in CPUs. To exploit the hardware capacity, some existing communication libraries provide an interface for parallelizing accesses to network endpoints with manual hints. However, with growing core counts per node in modern clusters, it is increasingly difficult for users to efficiently handle communication resources in multi-threading environments. We implemented a low-level communication library that can automatically schedule communication requests by offloading them to multiple dedicated threads via lockless circular buffers. To enhance the efficiency of offloading, we developed a novel technique to dynamically change the number of offloading threads using a user-level thread library. We demonstrate that our offloading architecture exhibits better performance characteristics in microbenchmark results than the existing approaches.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"2015 29","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120970157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Massively Parallel Method of Characteristics Neutron Transport Calculation with Anisotropic Scattering Treatment on GPUs gpu上各向异性散射处理下特征中子输运计算的大规模并行方法

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149460

Namjae Choi, Junsuk Kang, H. Joo

{"title":"Massively Parallel Method of Characteristics Neutron Transport Calculation with Anisotropic Scattering Treatment on GPUs","authors":"Namjae Choi, Junsuk Kang, H. Joo","doi":"10.1145/3149457.3149460","DOIUrl":"https://doi.org/10.1145/3149457.3149460","url":null,"abstract":"Even for the significant advances in CPU computing power and high performance computing, direct whole-core neutron transport calculation still remains unfeasible for the industrial applications. Furthermore, the improving trend of CPU technology is being challenged nowadays by thermal and power constraints. Thus, heterogeneous computing is increasingly receiving attention as an alternative for reactor physics. This work suggests a method to accelerate method of characteristics neutron transport calculation with anisotropic scattering treatment on GPUs. The method was implemented in nTRACER, a direct whole-core neutron transport calculation code being developed by Seoul National University. Performance results on VERA benchmark problem #5 P1 and P2 calculation presented 10-13 times speedup on GPU with adequate support of CPU compared to original CPU solver with 16-core parallel calculation. It was demonstrated that even an entry-level commercial GPU can be used as an effective means of reactor physics analysis if CPU -- GPU concurrency and single precision are properly utilized.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"4 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115905730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Maximizing Communication Overlap with Dynamic Program Analysis 最大化通信重叠与动态程序分析

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149459

Emmanuelle Saillard, Koushik Sen, W. Lavrijsen, Costin Iancu

{"title":"Maximizing Communication Overlap with Dynamic Program Analysis","authors":"Emmanuelle Saillard, Koushik Sen, W. Lavrijsen, Costin Iancu","doi":"10.1145/3149457.3149459","DOIUrl":"https://doi.org/10.1145/3149457.3149459","url":null,"abstract":"We present a dynamic program analysis approach to optimize communication overlap in scientific applications. Our tool instruments the code to generate a trace of the application's memory and synchronization behavior. An offline analysis determines the program optimal points for maximal overlap when considering several programming constructs: nonblocking one-sided communication operations, non-blocking collectives and bespoke synchronization patterns and operations. Feedback about possible transformations is presented to the user and the tool can perform the directed transformations, which are supported by a lightweight runtime. The value of our approach comes from: 1) the ability to optimize across boundaries of software modules or libraries, while specializing for the intrinsics of the underlying communication runtime; and 2) providing upper bounds on the expected performance improvements after communication optimizations. We have reduced the time spent in communication by as much as 64% for several applications that were already aggressively optimized for overlap; this indicates that manual optimizations leave untapped performance. Although demonstrated mainly for the UPC programming language, the methodology can be easily adapted to any other communication and synchronization API.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114756608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

LRUM: Local Reliability Protocol for Unreliable Hardware Multicast LRUM:不可靠硬件组播的本地可靠性协议

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149467

Hoang-Vu Dang, Brian Smith, R. Graham, G. Shainer

引用次数: 1

Iterative Solution of Sparse Linear Least Squares using LU Factorization 稀疏线性最小二乘的LU分解迭代解

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149462

G. Howell, M. Baboulin

引用次数: 1

A Dynamic Parallel Strategy for DOACROSS Loops DOACROSS循环的动态并行策略

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149469

Yuanzhen Cui, Song Liu, Nianjun Zou, Weiguo Wu

引用次数: 2