Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献

筛选
英文 中文
A Memory Saving Communication Method Using Remote Atomic Operations 一种使用远程原子操作的内存节省通信方法
Masaaki Fushimi, Takahiro Kawashima, Takafumi Nose, Nobutaka Ihara, S. Sumimoto, Naoyuki Shida
{"title":"A Memory Saving Communication Method Using Remote Atomic Operations","authors":"Masaaki Fushimi, Takahiro Kawashima, Takafumi Nose, Nobutaka Ihara, S. Sumimoto, Naoyuki Shida","doi":"10.1145/3293320.3293328","DOIUrl":"https://doi.org/10.1145/3293320.3293328","url":null,"abstract":"The MPI library for the K computer introduced a memory saving protocol. However, the protocol still requires memory in proportion to the number of MPI processes and a memory shortage can occur when the number of processes reaches millions or tens of millions. In order to solve the problem, we propose the shared receive buffer method which is a new communication protocol using remote atomic operations. This method is easily implemented if an interconnect has remote memory access and remote atomic memory operation. We implemented shared receive buffer method on PRIMEHPC FX100 system and evaluated. The per process memory usage of the proposed method is about one tenth compared to that of existing method.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131902626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Lightweight Method for Handling Control Divergence in GPGPUs 一种处理gpgpu控制发散的轻量级方法
YaoHua Yang, Shiqing Zhang, Li Shen
{"title":"A Lightweight Method for Handling Control Divergence in GPGPUs","authors":"YaoHua Yang, Shiqing Zhang, Li Shen","doi":"10.1145/3293320.3293331","DOIUrl":"https://doi.org/10.1145/3293320.3293331","url":null,"abstract":"At present, graphics processing units (GPUs) has been widely used for scientific and high performance acceleration in the general purpose computing area, which is inseparable from the SIMT (Single-Instruction, Multiple-Thread) execution model. With SIMT, GPUs can fully utilize the advantages of SIMD parallel computing. However, when threads in a warp do not follow the same execution path, control divergence generates and affects the hardware utilization. In response to this problem, warp regrouping method has been proposed to combine threads executing the same branch path, which can significantly improve thread-level parallelism. But it is found that not all warps can be regrouped effectively because that may introduce a lot of unnecessary overheads, limiting further performance improvement. In this paper, we analyze the source of overheads and propose a lightweight warp regrouping method --- Partial Warp Regrouping (PWR) that controls the scope of reorganization and avoids most of the unnecessary warp regrouping by setting thresholds. In this method, it also can reduce the complexity of hardware design. Our experimental results show that this mechanism can improve the performance by 12% on average and up to 27% compared with immediate post-dominator.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125733301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An investigation into the impact of the structured QR kernel on the overall performance of the TSQR algorithm 研究了结构化QR核对TSQR算法整体性能的影响
Takeshi Fukaya
{"title":"An investigation into the impact of the structured QR kernel on the overall performance of the TSQR algorithm","authors":"Takeshi Fukaya","doi":"10.1145/3293320.3293327","DOIUrl":"https://doi.org/10.1145/3293320.3293327","url":null,"abstract":"The TSQR algorithm is a communication-avoiding algorithm for computing the QR factorization of a tall and skinny (TS) matrix. The TSQR algorithm entails repeatedly executing a kernel that computes the QR factorization of a structured matrix. Although a single execution of structured QR requires small computational cost, it is repeated depending on the number of active parallel processes. The complicated computational pattern and small matrix size of structured QR are obstacles to achieving high performance. Thus, the computational cost of structured QR becomes a significant bottleneck in massively parallel computation. In this paper, we focus on the kernel of structured QR and discuss its implementation. We compare several kernels including those provided in LAPACK on modern processors, and investigate the impact of the different structured QR kernels on the overall performance of the TSQR algorithm.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132533210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Collective MPI-IO Using Topology-Aware Stepwise Data Aggregation with I/O Throttling 基于I/O节流的拓扑感知逐步数据聚合改进MPI-IO
Y. Tsujita, A. Hori, Toyohisa Kameyama, Atsuya Uno, F. Shoji, Y. Ishikawa
{"title":"Improving Collective MPI-IO Using Topology-Aware Stepwise Data Aggregation with I/O Throttling","authors":"Y. Tsujita, A. Hori, Toyohisa Kameyama, Atsuya Uno, F. Shoji, Y. Ishikawa","doi":"10.1145/3149457.3149464","DOIUrl":"https://doi.org/10.1145/3149457.3149464","url":null,"abstract":"MPI-IO has been used in an internal I/O interface layer of HDF5 or PnetCDF, where collective MPI-IO plays a big role in parallel I/O to manage a huge scale of scientific data. However, existing collective MPI-IO optimization named two-phase I/O has not been tuned enough for recent supercomputers consisting of mesh/torus interconnects and a huge scale of parallel file systems due to lack of topology-awareness in data transfers and optimization for parallel file systems. In this paper, we propose I/O throttling and topology-aware stepwise data aggregation in two-phase I/O of ROMIO, which is a representative MPI-IO library, in order to improve collective MPI-IO performance even if we have multiple processes per compute node. Throttling I/O requests going to a target file system mitigates I/O request contention, and consequently I/O performance improvements are achieved in file access phase of two-phase I/O. Topology-aware aggregator layout with paying attention to multiple aggregators per compute node alleviates contention in data aggregation phase of two-phase I/O. In addition, stepwise data aggregation improves data aggregation performance. HPIO benchmark results on the K computer indicate that the proposed optimization has achieved up to about 73% and 39% improvements in write performance compared with the original implementation using 12,288 and 24,576 processes on 3,072 and 6,144 compute nodes, respectively.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115670846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Parallelized Software Offloading of Low-Level Communication with User-Level Threads 与用户级线程的低级通信的并行软件卸载
Wataru Endo, K. Taura
{"title":"Parallelized Software Offloading of Low-Level Communication with User-Level Threads","authors":"Wataru Endo, K. Taura","doi":"10.1145/3149457.3149475","DOIUrl":"https://doi.org/10.1145/3149457.3149475","url":null,"abstract":"Although recent HPC interconnects are assumed to achieve low latency and high bandwidth communication, in practical terms, their performance is often bounded by the network software stacks rather than the underlying hardware because message processing requires a certain amount of computation in CPUs. To exploit the hardware capacity, some existing communication libraries provide an interface for parallelizing accesses to network endpoints with manual hints. However, with growing core counts per node in modern clusters, it is increasingly difficult for users to efficiently handle communication resources in multi-threading environments. We implemented a low-level communication library that can automatically schedule communication requests by offloading them to multiple dedicated threads via lockless circular buffers. To enhance the efficiency of offloading, we developed a novel technique to dynamically change the number of offloading threads using a user-level thread library. We demonstrate that our offloading architecture exhibits better performance characteristics in microbenchmark results than the existing approaches.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"2015 29","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120970157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Massively Parallel Method of Characteristics Neutron Transport Calculation with Anisotropic Scattering Treatment on GPUs gpu上各向异性散射处理下特征中子输运计算的大规模并行方法
Namjae Choi, Junsuk Kang, H. Joo
{"title":"Massively Parallel Method of Characteristics Neutron Transport Calculation with Anisotropic Scattering Treatment on GPUs","authors":"Namjae Choi, Junsuk Kang, H. Joo","doi":"10.1145/3149457.3149460","DOIUrl":"https://doi.org/10.1145/3149457.3149460","url":null,"abstract":"Even for the significant advances in CPU computing power and high performance computing, direct whole-core neutron transport calculation still remains unfeasible for the industrial applications. Furthermore, the improving trend of CPU technology is being challenged nowadays by thermal and power constraints. Thus, heterogeneous computing is increasingly receiving attention as an alternative for reactor physics. This work suggests a method to accelerate method of characteristics neutron transport calculation with anisotropic scattering treatment on GPUs. The method was implemented in nTRACER, a direct whole-core neutron transport calculation code being developed by Seoul National University. Performance results on VERA benchmark problem #5 P1 and P2 calculation presented 10-13 times speedup on GPU with adequate support of CPU compared to original CPU solver with 16-core parallel calculation. It was demonstrated that even an entry-level commercial GPU can be used as an effective means of reactor physics analysis if CPU -- GPU concurrency and single precision are properly utilized.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"4 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115905730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Maximizing Communication Overlap with Dynamic Program Analysis 最大化通信重叠与动态程序分析
Emmanuelle Saillard, Koushik Sen, W. Lavrijsen, Costin Iancu
{"title":"Maximizing Communication Overlap with Dynamic Program Analysis","authors":"Emmanuelle Saillard, Koushik Sen, W. Lavrijsen, Costin Iancu","doi":"10.1145/3149457.3149459","DOIUrl":"https://doi.org/10.1145/3149457.3149459","url":null,"abstract":"We present a dynamic program analysis approach to optimize communication overlap in scientific applications. Our tool instruments the code to generate a trace of the application's memory and synchronization behavior. An offline analysis determines the program optimal points for maximal overlap when considering several programming constructs: nonblocking one-sided communication operations, non-blocking collectives and bespoke synchronization patterns and operations. Feedback about possible transformations is presented to the user and the tool can perform the directed transformations, which are supported by a lightweight runtime. The value of our approach comes from: 1) the ability to optimize across boundaries of software modules or libraries, while specializing for the intrinsics of the underlying communication runtime; and 2) providing upper bounds on the expected performance improvements after communication optimizations. We have reduced the time spent in communication by as much as 64% for several applications that were already aggressively optimized for overlap; this indicates that manual optimizations leave untapped performance. Although demonstrated mainly for the UPC programming language, the methodology can be easily adapted to any other communication and synchronization API.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114756608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
LRUM: Local Reliability Protocol for Unreliable Hardware Multicast LRUM:不可靠硬件组播的本地可靠性协议
Hoang-Vu Dang, Brian Smith, R. Graham, G. Shainer
{"title":"LRUM: Local Reliability Protocol for Unreliable Hardware Multicast","authors":"Hoang-Vu Dang, Brian Smith, R. Graham, G. Shainer","doi":"10.1145/3149457.3149467","DOIUrl":"https://doi.org/10.1145/3149457.3149467","url":null,"abstract":"This paper describes two new Message Passing Interface (MPI) broadcast algorithms who's performance is essentially independent of communicator size. These are based on using the InfiniBand unreliable datagram (UD) hardware multicast capabilities, with a latency which is very close to that of the MPI ping-pong point-to-point latency between the root and the furthest away process in the communicator. These algorithms rely on a new scale-independent local reliability protocol that guarantees destination buffer availability under load imbalance. Performance is compared to that of HPC-X/Open MPI, MVAPICH and IntelMPI. The new algorithms provide the best available latency across the board. At 128 processes the new algorithms are 2.3 times better at four megabytes, 5% better at four kilobytes, and provide comparable performance at eight byte broadcasts when compared to the next best broadcast implementation. The new algorithms also demonstrate the lowest streaming latency and highest broadcast throughput.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129538782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Iterative Solution of Sparse Linear Least Squares using LU Factorization 稀疏线性最小二乘的LU分解迭代解
G. Howell, M. Baboulin
{"title":"Iterative Solution of Sparse Linear Least Squares using LU Factorization","authors":"G. Howell, M. Baboulin","doi":"10.1145/3149457.3149462","DOIUrl":"https://doi.org/10.1145/3149457.3149462","url":null,"abstract":"In this paper, we are interested in computing the solution of an overdetermined sparse linear least squares problem Ax=b via the normal equations method. Transforming the normal equations using the L factor from a rectangular LU decomposition of A usually leads to a better conditioned problem. Here we explore a further preconditioning by inv(L1) where L1 is the n × n upper part of the lower trapezoidal m × n factor L. Since the condition number of the iteration matrix can be easily bounded, we can determine whether the iteration will be effective, and whether further pre-conditioning is required. Numerical experiments are performed with the Julia programming language. When the upper triangular matrix U has no near zero diagonal elements, the algorithm is observed to be reliable. When A has only a few more rows than columns, convergence requires relatively few iterations and the algorithm usually requires less storage than the Cholesky factor of AtA or the R factor of the QR factorization of A.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115304712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Dynamic Parallel Strategy for DOACROSS Loops DOACROSS循环的动态并行策略
Yuanzhen Cui, Song Liu, Nianjun Zou, Weiguo Wu
{"title":"A Dynamic Parallel Strategy for DOACROSS Loops","authors":"Yuanzhen Cui, Song Liu, Nianjun Zou, Weiguo Wu","doi":"10.1145/3149457.3149469","DOIUrl":"https://doi.org/10.1145/3149457.3149469","url":null,"abstract":"Many parallelization methods work on exposing the pipeline/wave-front parallelism of DOACROSS loops through loop transformations. However, these methods statically assign iterations to available threads for parallel execution, and thus causing the waste of computing resources in synchronization among threads, especially in a multithreading environment. This paper proposes a brand-new parallel strategy that achieves wave-front parallelism with reduced dependences and provides dynamic tile assignment for DOACROSS loops, which has better ability to avoid threads from waiting in synchronization and utilize computing resources. The experimental results demonstrate that the proposed strategy outperforms two advanced strategies which are based on implicit barriers and POST/WAIT operations over six benchmarks on a multi-core server. The strategy also has better scalability for the increasing number of threads.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125939254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信