Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献_第5页

Adaptive Level Binning: A New Algorithm for Solving Sparse Triangular Systems 自适应水平分组:一种求解稀疏三角形系统的新算法

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368486

Buse Yilmaz, Bugrra Sipahiogrlu, Najeeb Ahmad, D. Unat

{"title":"Adaptive Level Binning: A New Algorithm for Solving Sparse Triangular Systems","authors":"Buse Yilmaz, Bugrra Sipahiogrlu, Najeeb Ahmad, D. Unat","doi":"10.1145/3368474.3368486","DOIUrl":"https://doi.org/10.1145/3368474.3368486","url":null,"abstract":"Sparse triangular solve (SpTRSV) is an important scientific kernel used in several applications such as preconditioners for Krylov methods. Parallelizing SpTRSV on multi-core systems is challenging since it exhibits limited parallelism due to computational dependencies and introduces high parallelization overhead due to finegrained and unbalanced nature of workloads. We propose a novel method, named Adaptive Level Binning (ALB), that addresses these challenges by eliminating redundant synchronization points and adapting the work granularity with an efficient load balancing strategy. Similar to the commonly used level-set methods for solving SpTRSV, ALB constructs level-sets of rows, where each level can be computed in parallel. Differently, ALB bins rows to levels adaptively and reduces redundant dependencies between rows. On an Intel® Xeon® Gold 6148 processor and NVIDIA® Tesla V100 GPU, ALB obtains 1.83x speedup on average and up to 5.28x speedup over Intel MKL and, over NVIDIA cuSPARSE, an average speedup of 2.80x and a maximum speedup of 39.40x for 29 matrices selected from Suite Sparse Matrix Collection.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125764599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

A Scalable Matrix-Free Iterative Eigensolver for Studying Many-Body Localization 研究多体定位的可伸缩无矩阵迭代特征解

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368497

R. Beeumen, Gregory D. Kahanamoku-Meyer, N. Yao, Chao Yang

{"title":"A Scalable Matrix-Free Iterative Eigensolver for Studying Many-Body Localization","authors":"R. Beeumen, Gregory D. Kahanamoku-Meyer, N. Yao, Chao Yang","doi":"10.1145/3368474.3368497","DOIUrl":"https://doi.org/10.1145/3368474.3368497","url":null,"abstract":"We present a scalable and matrix-free eigensolver for studying two-level quantum spin chain models with nearest-neighbor XX +YY interactions plus Z terms. In particular, we focus on the Heisenberg interaction plus random on-site fields, a model that is commonly used to study the many-body localization (MBL) transition. This type of problem is computationally challenging because the vector space dimension grows exponentially with the physical system size, and the solve must be iterated many times to average over different configurations of the random disorder. For each eigenvalue problem, eigenvalues from different regions of the spectrum and their corresponding eigenvectors need to be computed. Traditionally, the interior eigenstates for a single eigenvalue problem are computed via the shift-and-invert Lanczos algorithm. Due to the extremely high memory footprint of the LU factorizations, this technique is not well suited for large number of spins L, e.g., one needs thousands of compute nodes on modern high performance computing infrastructures to go beyond L = 24. The new matrix-free approach, proposed in this paper, does not suffer from this memory bottleneck and even allows for simulating spin chains up to L = 24 spins on a single compute node. We discuss the OpenMP and hybrid MPI--OpenMP implementations of matrix-free block matrix-vector operations that are the key components of the new approach. The efficiency and effectiveness of the proposed algorithm is demonstrated by computing eigenstates in a massively parallel fashion, and analyzing their entanglement entropy to gain insight into the MBL transition.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128341439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 亚太地区高性能计算国际会议论文集

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474

引用次数: 0

Wavelength-routing interconnect "Optical Hub" for parallel computing systems 并行计算系统的波长路由互连“光集线器”

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368495

Y. Urino, K. Mizutani, Tatsuya Usuki, S. Nakamura

{"title":"Wavelength-routing interconnect \"Optical Hub\" for parallel computing systems","authors":"Y. Urino, K. Mizutani, Tatsuya Usuki, S. Nakamura","doi":"10.1145/3368474.3368495","DOIUrl":"https://doi.org/10.1145/3368474.3368495","url":null,"abstract":"To solve the inter-node bandwidth bottleneck in parallel computing systems, we propose a wavelength-routing inter-node interconnect \"Optical Hub\". The physical topology of Optical Hub is star network, which leads to advantages in term of its throughput, size, energy consumption and life-time cost. The logical topology is full-mesh network, which leads to advantages in term of its latency and reliability. We introduced multi-path routings, which expand the effective bandwidth with the full-mesh topology such as Optical Hub, by replacing conventional MPI functions with our wrapper functions. We simulated execution time of parallel benchmarks on the parallel computing system with Optical Hub using parallel computing simulator SimGrid. As a result, we have confirmed that the parallel computing system with Optical Hub can achieve higher performance and lower energy consumption than conventional ones. We also examined the scalability of Optical Hub and showed that recursive hierarchical configurations of Optical Hub can save cable count drastically in case of large number of nodes against Dragonfly networks.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133692237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Towards GPU Acceleration of Phonon Computation with ShengBTE 基于ShengBTE的声子计算GPU加速研究

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368487

Yiming Wei, Xin You, Hailong Yang, Zhongzhi Luan, D. Qian

引用次数: 2

FFTE on SVE: SPIRAL-Generated Kernels SVE上的FFTE:螺旋生成内核

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368488

D. Takahashi, F. Franchetti

引用次数: 3

Integrating Cache Oblivious Approach with Modern Processor Architecture: The Case of Floyd-Warshall Algorithm 缓存无关方法与现代处理器体系结构的集成:以Floyd-Warshall算法为例

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368477

Toshio Endo

引用次数: 1

Scalable Direct-Iterative Hybrid Solver for Sparse Matrices on Multi-Core and Vector Architectures 多核和矢量结构下稀疏矩阵的可伸缩直接迭代混合求解器

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368484

K. Ono, Toshihiro Kato, S. Ohshima, T. Nanri

引用次数: 2

On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency 应用程序内存带宽和内存访问延迟的正确测量

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368476

Christian Helm, K. Taura

{"title":"On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency","authors":"Christian Helm, K. Taura","doi":"10.1145/3368474.3368476","DOIUrl":"https://doi.org/10.1145/3368474.3368476","url":null,"abstract":"Diagnosing if an application suffers from DRAM contention can be a challenging task. One method is to compare the hardware memory bandwidth limit with the measured memory bandwidth of an application. Another method is based on memory access latency. The latency of a DRAM access in an uncontended state is a hardware characteristic. If an application shows higher DRAM access latency, the increase comes from queuing delays and the application is limited by DRAM bandwidth. Hardware-based measurement of the application's latency and bandwidth can be done with low-overhead and is agnostic of the application's implementation. But the practical implementation of such a diagnosis system on CPUs is difficult. In modern CPUs, there is an abundance of performance counters and only superficial documentation. Different types of counters for bandwidth or latency, that seemingly measure the same thing, produce different results. There is no in-depth understanding of those performance counters and naive usage may lead to incorrect measurements. Because there is no hardware feature to measure DRAM access latency directly, the implementation of the above-mentioned latency based method may seem impossible. In this paper, we compare various hardware latency and bandwidth measurement methods on CPUs by using micro-benchmarks. We show results of Intel Haswell, Broadwell and Skylake systems. With our experiments, we show how and why performance counters for bandwidth and latency differ. Only the counters inside of the memory controller correctly measure bandwidth. Latency measured by instruction sampling is suitable to find DRAM contention, even though it is not a pure DRAM access latency.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125696629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Quantum Dynamics at Scale: Ultrafast Control of Emergent Functional Materials 尺度量子动力学:新兴功能材料的超快控制

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI: 10.1145/3368474.3368489

S. Tiwari, A. Krishnamoorthy, P. Rajak, Putt Sakdhnagool, Manaschai Kunaseth, F. Shimojo, S. Fukushima, A. Nakano, Ye Luo, R. Kalia, K. Nomura, P. Vashishta

{"title":"Quantum Dynamics at Scale: Ultrafast Control of Emergent Functional Materials","authors":"S. Tiwari, A. Krishnamoorthy, P. Rajak, Putt Sakdhnagool, Manaschai Kunaseth, F. Shimojo, S. Fukushima, A. Nakano, Ye Luo, R. Kalia, K. Nomura, P. Vashishta","doi":"10.1145/3368474.3368489","DOIUrl":"https://doi.org/10.1145/3368474.3368489","url":null,"abstract":"Confluence of extreme-scale quantum dynamics simulations (i.e. quantum@scale) and cutting-edge x-ray free-electron laser experiments are revolutionizing materials science. An archetypal example is the exciting concept of using picosecond light pulses to control emergent material properties on demand in atomically-thin layered materials. This paper describes efforts to scale our quantum molecular dynamics engine toward the United States' first exaflop/s computer, under an Aurora Early Science Program project named \"Metascalable layered material genome\". Key algorithmic and computing techniques incorporated are: (1) globally-scalable and locally-fast solvers within a linear-scaling divide-conquer-recombine algorithmic framework; (2) algebraic 'BLASification' of computational kernels; and (3) data alignment and loop restructuring, along with register and cache blocking, for enhanced vectorization and efficient memory access. The resulting weak-scaling parallel efficiency was 0.93 on 131,072 Intel Xeon Phi cores for a 56.6 million atom (or 169 million valence-electron) system, whereas the various code transformations achieved 5-fold speedup. The optimized simulation engine allowed us for the first time to establish a significant effect of substrate on the dynamics of layered material upon electronic excitation.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132117525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0