{"title":"A Memory Saving Communication Method Using Remote Atomic Operations","authors":"Masaaki Fushimi, Takahiro Kawashima, Takafumi Nose, Nobutaka Ihara, S. Sumimoto, Naoyuki Shida","doi":"10.1145/3293320.3293328","DOIUrl":"https://doi.org/10.1145/3293320.3293328","url":null,"abstract":"The MPI library for the K computer introduced a memory saving protocol. However, the protocol still requires memory in proportion to the number of MPI processes and a memory shortage can occur when the number of processes reaches millions or tens of millions. In order to solve the problem, we propose the shared receive buffer method which is a new communication protocol using remote atomic operations. This method is easily implemented if an interconnect has remote memory access and remote atomic memory operation. We implemented shared receive buffer method on PRIMEHPC FX100 system and evaluated. The per process memory usage of the proposed method is about one tenth compared to that of existing method.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131902626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Lightweight Method for Handling Control Divergence in GPGPUs","authors":"YaoHua Yang, Shiqing Zhang, Li Shen","doi":"10.1145/3293320.3293331","DOIUrl":"https://doi.org/10.1145/3293320.3293331","url":null,"abstract":"At present, graphics processing units (GPUs) has been widely used for scientific and high performance acceleration in the general purpose computing area, which is inseparable from the SIMT (Single-Instruction, Multiple-Thread) execution model. With SIMT, GPUs can fully utilize the advantages of SIMD parallel computing. However, when threads in a warp do not follow the same execution path, control divergence generates and affects the hardware utilization. In response to this problem, warp regrouping method has been proposed to combine threads executing the same branch path, which can significantly improve thread-level parallelism. But it is found that not all warps can be regrouped effectively because that may introduce a lot of unnecessary overheads, limiting further performance improvement. In this paper, we analyze the source of overheads and propose a lightweight warp regrouping method --- Partial Warp Regrouping (PWR) that controls the scope of reorganization and avoids most of the unnecessary warp regrouping by setting thresholds. In this method, it also can reduce the complexity of hardware design. Our experimental results show that this mechanism can improve the performance by 12% on average and up to 27% compared with immediate post-dominator.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125733301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An investigation into the impact of the structured QR kernel on the overall performance of the TSQR algorithm","authors":"Takeshi Fukaya","doi":"10.1145/3293320.3293327","DOIUrl":"https://doi.org/10.1145/3293320.3293327","url":null,"abstract":"The TSQR algorithm is a communication-avoiding algorithm for computing the QR factorization of a tall and skinny (TS) matrix. The TSQR algorithm entails repeatedly executing a kernel that computes the QR factorization of a structured matrix. Although a single execution of structured QR requires small computational cost, it is repeated depending on the number of active parallel processes. The complicated computational pattern and small matrix size of structured QR are obstacles to achieving high performance. Thus, the computational cost of structured QR becomes a significant bottleneck in massively parallel computation. In this paper, we focus on the kernel of structured QR and discuss its implementation. We compare several kernels including those provided in LAPACK on modern processors, and investigate the impact of the different structured QR kernels on the overall performance of the TSQR algorithm.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132533210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Tsujita, A. Hori, Toyohisa Kameyama, Atsuya Uno, F. Shoji, Y. Ishikawa
{"title":"Improving Collective MPI-IO Using Topology-Aware Stepwise Data Aggregation with I/O Throttling","authors":"Y. Tsujita, A. Hori, Toyohisa Kameyama, Atsuya Uno, F. Shoji, Y. Ishikawa","doi":"10.1145/3149457.3149464","DOIUrl":"https://doi.org/10.1145/3149457.3149464","url":null,"abstract":"MPI-IO has been used in an internal I/O interface layer of HDF5 or PnetCDF, where collective MPI-IO plays a big role in parallel I/O to manage a huge scale of scientific data. However, existing collective MPI-IO optimization named two-phase I/O has not been tuned enough for recent supercomputers consisting of mesh/torus interconnects and a huge scale of parallel file systems due to lack of topology-awareness in data transfers and optimization for parallel file systems. In this paper, we propose I/O throttling and topology-aware stepwise data aggregation in two-phase I/O of ROMIO, which is a representative MPI-IO library, in order to improve collective MPI-IO performance even if we have multiple processes per compute node. Throttling I/O requests going to a target file system mitigates I/O request contention, and consequently I/O performance improvements are achieved in file access phase of two-phase I/O. Topology-aware aggregator layout with paying attention to multiple aggregators per compute node alleviates contention in data aggregation phase of two-phase I/O. In addition, stepwise data aggregation improves data aggregation performance. HPIO benchmark results on the K computer indicate that the proposed optimization has achieved up to about 73% and 39% improvements in write performance compared with the original implementation using 12,288 and 24,576 processes on 3,072 and 6,144 compute nodes, respectively.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115670846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallelized Software Offloading of Low-Level Communication with User-Level Threads","authors":"Wataru Endo, K. Taura","doi":"10.1145/3149457.3149475","DOIUrl":"https://doi.org/10.1145/3149457.3149475","url":null,"abstract":"Although recent HPC interconnects are assumed to achieve low latency and high bandwidth communication, in practical terms, their performance is often bounded by the network software stacks rather than the underlying hardware because message processing requires a certain amount of computation in CPUs. To exploit the hardware capacity, some existing communication libraries provide an interface for parallelizing accesses to network endpoints with manual hints. However, with growing core counts per node in modern clusters, it is increasingly difficult for users to efficiently handle communication resources in multi-threading environments. We implemented a low-level communication library that can automatically schedule communication requests by offloading them to multiple dedicated threads via lockless circular buffers. To enhance the efficiency of offloading, we developed a novel technique to dynamically change the number of offloading threads using a user-level thread library. We demonstrate that our offloading architecture exhibits better performance characteristics in microbenchmark results than the existing approaches.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"2015 29","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120970157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Massively Parallel Method of Characteristics Neutron Transport Calculation with Anisotropic Scattering Treatment on GPUs","authors":"Namjae Choi, Junsuk Kang, H. Joo","doi":"10.1145/3149457.3149460","DOIUrl":"https://doi.org/10.1145/3149457.3149460","url":null,"abstract":"Even for the significant advances in CPU computing power and high performance computing, direct whole-core neutron transport calculation still remains unfeasible for the industrial applications. Furthermore, the improving trend of CPU technology is being challenged nowadays by thermal and power constraints. Thus, heterogeneous computing is increasingly receiving attention as an alternative for reactor physics. This work suggests a method to accelerate method of characteristics neutron transport calculation with anisotropic scattering treatment on GPUs. The method was implemented in nTRACER, a direct whole-core neutron transport calculation code being developed by Seoul National University. Performance results on VERA benchmark problem #5 P1 and P2 calculation presented 10-13 times speedup on GPU with adequate support of CPU compared to original CPU solver with 16-core parallel calculation. It was demonstrated that even an entry-level commercial GPU can be used as an effective means of reactor physics analysis if CPU -- GPU concurrency and single precision are properly utilized.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"4 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115905730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emmanuelle Saillard, Koushik Sen, W. Lavrijsen, Costin Iancu
{"title":"Maximizing Communication Overlap with Dynamic Program Analysis","authors":"Emmanuelle Saillard, Koushik Sen, W. Lavrijsen, Costin Iancu","doi":"10.1145/3149457.3149459","DOIUrl":"https://doi.org/10.1145/3149457.3149459","url":null,"abstract":"We present a dynamic program analysis approach to optimize communication overlap in scientific applications. Our tool instruments the code to generate a trace of the application's memory and synchronization behavior. An offline analysis determines the program optimal points for maximal overlap when considering several programming constructs: nonblocking one-sided communication operations, non-blocking collectives and bespoke synchronization patterns and operations. Feedback about possible transformations is presented to the user and the tool can perform the directed transformations, which are supported by a lightweight runtime. The value of our approach comes from: 1) the ability to optimize across boundaries of software modules or libraries, while specializing for the intrinsics of the underlying communication runtime; and 2) providing upper bounds on the expected performance improvements after communication optimizations. We have reduced the time spent in communication by as much as 64% for several applications that were already aggressively optimized for overlap; this indicates that manual optimizations leave untapped performance. Although demonstrated mainly for the UPC programming language, the methodology can be easily adapted to any other communication and synchronization API.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114756608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LRUM: Local Reliability Protocol for Unreliable Hardware Multicast","authors":"Hoang-Vu Dang, Brian Smith, R. Graham, G. Shainer","doi":"10.1145/3149457.3149467","DOIUrl":"https://doi.org/10.1145/3149457.3149467","url":null,"abstract":"This paper describes two new Message Passing Interface (MPI) broadcast algorithms who's performance is essentially independent of communicator size. These are based on using the InfiniBand unreliable datagram (UD) hardware multicast capabilities, with a latency which is very close to that of the MPI ping-pong point-to-point latency between the root and the furthest away process in the communicator. These algorithms rely on a new scale-independent local reliability protocol that guarantees destination buffer availability under load imbalance. Performance is compared to that of HPC-X/Open MPI, MVAPICH and IntelMPI. The new algorithms provide the best available latency across the board. At 128 processes the new algorithms are 2.3 times better at four megabytes, 5% better at four kilobytes, and provide comparable performance at eight byte broadcasts when compared to the next best broadcast implementation. The new algorithms also demonstrate the lowest streaming latency and highest broadcast throughput.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129538782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Iterative Solution of Sparse Linear Least Squares using LU Factorization","authors":"G. Howell, M. Baboulin","doi":"10.1145/3149457.3149462","DOIUrl":"https://doi.org/10.1145/3149457.3149462","url":null,"abstract":"In this paper, we are interested in computing the solution of an overdetermined sparse linear least squares problem Ax=b via the normal equations method. Transforming the normal equations using the L factor from a rectangular LU decomposition of A usually leads to a better conditioned problem. Here we explore a further preconditioning by inv(L1) where L1 is the n × n upper part of the lower trapezoidal m × n factor L. Since the condition number of the iteration matrix can be easily bounded, we can determine whether the iteration will be effective, and whether further pre-conditioning is required. Numerical experiments are performed with the Julia programming language. When the upper triangular matrix U has no near zero diagonal elements, the algorithm is observed to be reliable. When A has only a few more rows than columns, convergence requires relatively few iterations and the algorithm usually requires less storage than the Cholesky factor of AtA or the R factor of the QR factorization of A.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115304712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Dynamic Parallel Strategy for DOACROSS Loops","authors":"Yuanzhen Cui, Song Liu, Nianjun Zou, Weiguo Wu","doi":"10.1145/3149457.3149469","DOIUrl":"https://doi.org/10.1145/3149457.3149469","url":null,"abstract":"Many parallelization methods work on exposing the pipeline/wave-front parallelism of DOACROSS loops through loop transformations. However, these methods statically assign iterations to available threads for parallel execution, and thus causing the waste of computing resources in synchronization among threads, especially in a multithreading environment. This paper proposes a brand-new parallel strategy that achieves wave-front parallelism with reduced dependences and provides dynamic tile assignment for DOACROSS loops, which has better ability to avoid threads from waiting in synchronization and utilize computing resources. The experimental results demonstrate that the proposed strategy outperforms two advanced strategies which are based on implicit barriers and POST/WAIT operations over six benchmarks on a multi-core server. The strategy also has better scalability for the increasing number of threads.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125939254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}