Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献

筛选
英文 中文
Adaptive Level Binning: A New Algorithm for Solving Sparse Triangular Systems 自适应水平分组:一种求解稀疏三角形系统的新算法
Buse Yilmaz, Bugrra Sipahiogrlu, Najeeb Ahmad, D. Unat
{"title":"Adaptive Level Binning: A New Algorithm for Solving Sparse Triangular Systems","authors":"Buse Yilmaz, Bugrra Sipahiogrlu, Najeeb Ahmad, D. Unat","doi":"10.1145/3368474.3368486","DOIUrl":"https://doi.org/10.1145/3368474.3368486","url":null,"abstract":"Sparse triangular solve (SpTRSV) is an important scientific kernel used in several applications such as preconditioners for Krylov methods. Parallelizing SpTRSV on multi-core systems is challenging since it exhibits limited parallelism due to computational dependencies and introduces high parallelization overhead due to finegrained and unbalanced nature of workloads. We propose a novel method, named Adaptive Level Binning (ALB), that addresses these challenges by eliminating redundant synchronization points and adapting the work granularity with an efficient load balancing strategy. Similar to the commonly used level-set methods for solving SpTRSV, ALB constructs level-sets of rows, where each level can be computed in parallel. Differently, ALB bins rows to levels adaptively and reduces redundant dependencies between rows. On an Intel® Xeon® Gold 6148 processor and NVIDIA® Tesla V100 GPU, ALB obtains 1.83x speedup on average and up to 5.28x speedup over Intel MKL and, over NVIDIA cuSPARSE, an average speedup of 2.80x and a maximum speedup of 39.40x for 29 matrices selected from Suite Sparse Matrix Collection.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125764599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A Scalable Matrix-Free Iterative Eigensolver for Studying Many-Body Localization 研究多体定位的可伸缩无矩阵迭代特征解
R. Beeumen, Gregory D. Kahanamoku-Meyer, N. Yao, Chao Yang
{"title":"A Scalable Matrix-Free Iterative Eigensolver for Studying Many-Body Localization","authors":"R. Beeumen, Gregory D. Kahanamoku-Meyer, N. Yao, Chao Yang","doi":"10.1145/3368474.3368497","DOIUrl":"https://doi.org/10.1145/3368474.3368497","url":null,"abstract":"We present a scalable and matrix-free eigensolver for studying two-level quantum spin chain models with nearest-neighbor XX +YY interactions plus Z terms. In particular, we focus on the Heisenberg interaction plus random on-site fields, a model that is commonly used to study the many-body localization (MBL) transition. This type of problem is computationally challenging because the vector space dimension grows exponentially with the physical system size, and the solve must be iterated many times to average over different configurations of the random disorder. For each eigenvalue problem, eigenvalues from different regions of the spectrum and their corresponding eigenvectors need to be computed. Traditionally, the interior eigenstates for a single eigenvalue problem are computed via the shift-and-invert Lanczos algorithm. Due to the extremely high memory footprint of the LU factorizations, this technique is not well suited for large number of spins L, e.g., one needs thousands of compute nodes on modern high performance computing infrastructures to go beyond L = 24. The new matrix-free approach, proposed in this paper, does not suffer from this memory bottleneck and even allows for simulating spin chains up to L = 24 spins on a single compute node. We discuss the OpenMP and hybrid MPI--OpenMP implementations of matrix-free block matrix-vector operations that are the key components of the new approach. The efficiency and effectiveness of the proposed algorithm is demonstrated by computing eigenstates in a massively parallel fashion, and analyzing their entanglement entropy to gain insight into the MBL transition.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128341439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 亚太地区高性能计算国际会议论文集
{"title":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","authors":"","doi":"10.1145/3368474","DOIUrl":"https://doi.org/10.1145/3368474","url":null,"abstract":"","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131136934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Wavelength-routing interconnect "Optical Hub" for parallel computing systems 并行计算系统的波长路由互连“光集线器”
Y. Urino, K. Mizutani, Tatsuya Usuki, S. Nakamura
{"title":"Wavelength-routing interconnect \"Optical Hub\" for parallel computing systems","authors":"Y. Urino, K. Mizutani, Tatsuya Usuki, S. Nakamura","doi":"10.1145/3368474.3368495","DOIUrl":"https://doi.org/10.1145/3368474.3368495","url":null,"abstract":"To solve the inter-node bandwidth bottleneck in parallel computing systems, we propose a wavelength-routing inter-node interconnect \"Optical Hub\". The physical topology of Optical Hub is star network, which leads to advantages in term of its throughput, size, energy consumption and life-time cost. The logical topology is full-mesh network, which leads to advantages in term of its latency and reliability. We introduced multi-path routings, which expand the effective bandwidth with the full-mesh topology such as Optical Hub, by replacing conventional MPI functions with our wrapper functions. We simulated execution time of parallel benchmarks on the parallel computing system with Optical Hub using parallel computing simulator SimGrid. As a result, we have confirmed that the parallel computing system with Optical Hub can achieve higher performance and lower energy consumption than conventional ones. We also examined the scalability of Optical Hub and showed that recursive hierarchical configurations of Optical Hub can save cable count drastically in case of large number of nodes against Dragonfly networks.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133692237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Towards GPU Acceleration of Phonon Computation with ShengBTE 基于ShengBTE的声子计算GPU加速研究
Yiming Wei, Xin You, Hailong Yang, Zhongzhi Luan, D. Qian
{"title":"Towards GPU Acceleration of Phonon Computation with ShengBTE","authors":"Yiming Wei, Xin You, Hailong Yang, Zhongzhi Luan, D. Qian","doi":"10.1145/3368474.3368487","DOIUrl":"https://doi.org/10.1145/3368474.3368487","url":null,"abstract":"ShengBTE is one of the software packages that are commonly used in the field of phonon computation (e.g., to determine the lattice thermal conductivity). ShengBTE simulates the phonon diffusion by solving the Boltzmann transport equations, which take long execution time to derive the simulation results due to the high computation complexity. This paper mainly focuses on the performance optimization of ShengBTE on GPU. We identify the performance bottlenecks of ShengBTE and propose corresponding optimizations such as loop-carried dependency elimination, hotspot function acceleration on GPU and performance tuning on thread block. The experiment results show that the proposed optimizations significantly improve the performance of ShengBTE, which achieves an average speedup of 9.06x and 13.74x on discrete temperature simulation and continuous temperature simulation respectively without losing accuracy.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133904021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
FFTE on SVE: SPIRAL-Generated Kernels SVE上的FFTE:螺旋生成内核
D. Takahashi, F. Franchetti
{"title":"FFTE on SVE: SPIRAL-Generated Kernels","authors":"D. Takahashi, F. Franchetti","doi":"10.1145/3368474.3368488","DOIUrl":"https://doi.org/10.1145/3368474.3368488","url":null,"abstract":"In this paper we propose an implementation of the fast Fourier transform (FFT) targeting the ARM Scalable Vector Extension (SVE). We performed automatic vectorization via a compiler and an explicit vectorization through code generation by SPIRAL for FFT kernels, and compared the performance. We show that the explicit vectorization of SPIRAL generated code improves performance significantly. Performance results of FFTs on RIKEN's Fugaku processor simulator are reported. With the ARM compiler SPIRAL-generated FFT kernels written in SVE intrinsic are up to 3.16 times faster than FFT kernels of FFTE written in Fortran and up to 5.62 times faster than SPIRAL-generated FFT kernels written in C.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127943415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Integrating Cache Oblivious Approach with Modern Processor Architecture: The Case of Floyd-Warshall Algorithm 缓存无关方法与现代处理器体系结构的集成:以Floyd-Warshall算法为例
Toshio Endo
{"title":"Integrating Cache Oblivious Approach with Modern Processor Architecture: The Case of Floyd-Warshall Algorithm","authors":"Toshio Endo","doi":"10.1145/3368474.3368477","DOIUrl":"https://doi.org/10.1145/3368474.3368477","url":null,"abstract":"In order to implement algorithms on processors with deep cache hierarchy, the cache oblivious approach, which is based on recursive divide and conquer, is considered to be promising. This paper focuses on single-node implementation of Floyd-Warshall (FW) algorithm, which is an important graph computation kernel. For higher performance, another facility of modern processors, SIMD instructions need to be integrated to recursive approach efficiently. This paper describes a methodology to construct recursive implementations that takes architecture with SIMD and multi-core into account while harnessing cache. The experiment shows our FW implementation exhibits around 1.1 TFlops on a dual-socket SkyLake machine and 700 GFlops on a Xeon Phi machine, both of which have AVX512 SIMD ISA.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117277458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scalable Direct-Iterative Hybrid Solver for Sparse Matrices on Multi-Core and Vector Architectures 多核和矢量结构下稀疏矩阵的可伸缩直接迭代混合求解器
K. Ono, Toshihiro Kato, S. Ohshima, T. Nanri
{"title":"Scalable Direct-Iterative Hybrid Solver for Sparse Matrices on Multi-Core and Vector Architectures","authors":"K. Ono, Toshihiro Kato, S. Ohshima, T. Nanri","doi":"10.1145/3368474.3368484","DOIUrl":"https://doi.org/10.1145/3368474.3368484","url":null,"abstract":"In the present paper, we propose an efficient direct-iterative hybrid solver for sparse matrices that can derive the scalability of the latest multi-core, many-core, and vector architectures and examine the execution performance of the proposed SLOR-PCR method. We also present an efficient implementation of the PCR algorithm for SIMD and vector architectures so that it is easy to output instructions optimized by the compiler. The proposed hybrid method has high cache reusability, which is favorable for modern low B/F architecture because efficient use of the cache can mitigate the memory bandwidth limitation. The measured performance revealed that the SLOR-PCR solver showed excellent scalability up to 352 cores on the cc-NUMA environment, and the achieved performance was higher than that of the conventional Jacobi and Red-Black ordering method by a factor of 3.6 to 8.3 on the SIMD architecture. In addition, the maximum speedup in computation time was observed to be a factor of 6.3 on the cc-NUMA architecture with 352 cores.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116592196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency 应用程序内存带宽和内存访问延迟的正确测量
Christian Helm, K. Taura
{"title":"On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency","authors":"Christian Helm, K. Taura","doi":"10.1145/3368474.3368476","DOIUrl":"https://doi.org/10.1145/3368474.3368476","url":null,"abstract":"Diagnosing if an application suffers from DRAM contention can be a challenging task. One method is to compare the hardware memory bandwidth limit with the measured memory bandwidth of an application. Another method is based on memory access latency. The latency of a DRAM access in an uncontended state is a hardware characteristic. If an application shows higher DRAM access latency, the increase comes from queuing delays and the application is limited by DRAM bandwidth. Hardware-based measurement of the application's latency and bandwidth can be done with low-overhead and is agnostic of the application's implementation. But the practical implementation of such a diagnosis system on CPUs is difficult. In modern CPUs, there is an abundance of performance counters and only superficial documentation. Different types of counters for bandwidth or latency, that seemingly measure the same thing, produce different results. There is no in-depth understanding of those performance counters and naive usage may lead to incorrect measurements. Because there is no hardware feature to measure DRAM access latency directly, the implementation of the above-mentioned latency based method may seem impossible. In this paper, we compare various hardware latency and bandwidth measurement methods on CPUs by using micro-benchmarks. We show results of Intel Haswell, Broadwell and Skylake systems. With our experiments, we show how and why performance counters for bandwidth and latency differ. Only the counters inside of the memory controller correctly measure bandwidth. Latency measured by instruction sampling is suitable to find DRAM contention, even though it is not a pure DRAM access latency.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125696629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Quantum Dynamics at Scale: Ultrafast Control of Emergent Functional Materials 尺度量子动力学:新兴功能材料的超快控制
S. Tiwari, A. Krishnamoorthy, P. Rajak, Putt Sakdhnagool, Manaschai Kunaseth, F. Shimojo, S. Fukushima, A. Nakano, Ye Luo, R. Kalia, K. Nomura, P. Vashishta
{"title":"Quantum Dynamics at Scale: Ultrafast Control of Emergent Functional Materials","authors":"S. Tiwari, A. Krishnamoorthy, P. Rajak, Putt Sakdhnagool, Manaschai Kunaseth, F. Shimojo, S. Fukushima, A. Nakano, Ye Luo, R. Kalia, K. Nomura, P. Vashishta","doi":"10.1145/3368474.3368489","DOIUrl":"https://doi.org/10.1145/3368474.3368489","url":null,"abstract":"Confluence of extreme-scale quantum dynamics simulations (i.e. quantum@scale) and cutting-edge x-ray free-electron laser experiments are revolutionizing materials science. An archetypal example is the exciting concept of using picosecond light pulses to control emergent material properties on demand in atomically-thin layered materials. This paper describes efforts to scale our quantum molecular dynamics engine toward the United States' first exaflop/s computer, under an Aurora Early Science Program project named \"Metascalable layered material genome\". Key algorithmic and computing techniques incorporated are: (1) globally-scalable and locally-fast solvers within a linear-scaling divide-conquer-recombine algorithmic framework; (2) algebraic 'BLASification' of computational kernels; and (3) data alignment and loop restructuring, along with register and cache blocking, for enhanced vectorization and efficient memory access. The resulting weak-scaling parallel efficiency was 0.93 on 131,072 Intel Xeon Phi cores for a 56.6 million atom (or 169 million valence-electron) system, whereas the various code transformations achieved 5-fold speedup. The optimized simulation engine allowed us for the first time to establish a significant effect of substrate on the dynamics of layered material upon electronic excitation.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132117525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信