2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第10页

Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication 虚拟链路:用于跨核心通信的可伸缩多生产者多消费者消息队列体系结构

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-12-09 DOI: 10.1109/IPDPS49936.2021.00027

Qinzhe Wu, J. Beard, Ashen Ekanayake, A. Gerstlauer, L. John

{"title":"Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication","authors":"Qinzhe Wu, J. Beard, Ashen Ekanayake, A. Gerstlauer, L. John","doi":"10.1109/IPDPS49936.2021.00027","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00027","url":null,"abstract":"Cross-core communication is increasingly a bottleneck as the number of processing elements increase per systemon-chip. Typical hardware solutions to cross-core communication are often inflexible; while software solutions are flexible, they have performance scaling limitations. A key problem, as we will show, is that of shared state in software-based message queue mechanisms. This paper proposes Virtual-Link (VL), a novel light-weight communication mechanism with hardware support to facilitate M:N lock-free data movement. VL reduces the amount of coherent shared state, which is a bottleneck for many approaches, to zero. VL provides further latency benefit by keeping data on the fast path (i.e., within the onchip interconnect). VL enables directed cache-injection (stashing) between PEs on the coherence bus, reducing the latency for coreto-core communication. VL is particularly effective for fine-grain tasks on streaming data. Evaluation on a full system simulator with 7 benchmarks shows that VL achieves a $2.09times$ speedup over state-of-the-art software-based communication mechanisms, while reducing memory traffic by 61%.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129487687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph Neural Networks 用于图嵌入和图神经网络的统一SDDMM-SpMM核

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-11-07 DOI: 10.1109/IPDPS49936.2021.00034

Md. Khaledur Rahman, Majedul Haque Sujon, A. Azad

引用次数: 29

Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems 利用PaRSEC运行时支持解决具有挑战性的3D数据稀疏矩阵问题

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-11-01 DOI: 10.1109/IPDPS49936.2021.00017

Qinglei Cao, Yu Pei, Kadir Akbudak, G. Bosilca, H. Ltaief, D. Keyes, J. Dongarra

{"title":"Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems","authors":"Qinglei Cao, Yu Pei, Kadir Akbudak, G. Bosilca, H. Ltaief, D. Keyes, J. Dongarra","doi":"10.1109/IPDPS49936.2021.00017","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00017","url":null,"abstract":"The task-based programming model associated with dynamic runtime systems has gained popularity for challenging problems because of workload imbalance, heterogeneous resources, or extreme concurrency. During the last decade, low-rank matrix approximations—where the main idea consists of exploiting data sparsity, typically by compressing off-diagonal tiles up to an application-specific accuracy threshold—have been adopted to address the curse of dimensionality at extreme scale. In this paper, we create a bridge between the runtime and the linear algebra by communicating knowledge of the data sparsity to the runtime. We design and implement this synergistic approach with high user productivity in mind, in the context of the PaRSEC runtime system and the HiCMA numerical library. This requires extending PaRSEC with new features to integrate rank information into the dataflow so that proper decisions can be made at runtime. We focus on the tile low-rank (TLR) Cholesky factorization for solving 3D data-sparse covariance matrix problems arising in environmental applications. In particular, we employ the 3D exponential model of the Mateŕn matrix kernel, which exhibits challenging nonuniform high ranks in off-diagonal tiles. We first provide dynamic data structure management driven by a performance model to reduce extra floating-point operations. Next, we optimize the memory footprint of the application by relying on a dynamic memory allocator, and supported by a rank-aware data distribution to cope with the workload imbalance. Finally, we expose further parallelism using kernel recursive formulations to shorten the critical path. Our resulting high-performance implementation outperforms existing data-sparse TLR Cholesky factorization by up to 7-fold on a large-scale distributed-memory system, while minimizing the memory footprint up to a 44-fold factor. This multidisciplinary work highlights the need to empower runtime systems beyond their original duty of task scheduling for servicing next-generation low-rank matrix algebra libraries.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131165096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Matrix Engines for High Performance Computing: A Paragon of Performance or Grasping at Straws? 用于高性能计算的矩阵引擎:性能的典范还是抓稻草?

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-10-27 DOI: 10.1109/IPDPS49936.2021.00114

Jens Domke, Emil Vatai, Aleksandr Drozd, Peng Chen, Yosuke Oyama, Lingqi Zhang, Shweta Salaria, Daichi Mukunoki, Artur Podobas, M. Wahib, S. Matsuoka

引用次数: 18

High-Performance Spectral Element Methods on Field-Programmable Gate Arrays : Implementation, Evaluation, and Future Projection 现场可编程门阵列的高性能谱元方法:实现、评估和未来预测

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-10-26 DOI: 10.1109/IPDPS49936.2021.00116

Martin Karp, Artur Podobas, Niclas Jansson, Tobias Kenter, Christian Plessl, P. Schlatter, S. Markidis

{"title":"High-Performance Spectral Element Methods on Field-Programmable Gate Arrays : Implementation, Evaluation, and Future Projection","authors":"Martin Karp, Artur Podobas, Niclas Jansson, Tobias Kenter, Christian Plessl, P. Schlatter, S. Markidis","doi":"10.1109/IPDPS49936.2021.00116","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00116","url":null,"abstract":"Improvements in computer systems have historically relied on two well-known observations: Moore’s law and Dennard’s scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the general-purpose architectures’ comforts in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate Array (FPGA), which strikes a convenient balance between complexity and performance. In this paper, we study modern FPGAs’ applicability in accelerating the Spectral Element Method (SEM) core to many computational fluid dynamics (CFD) applications. We design a custom SEM hardware accelerator operating in double-precision that we empirically evaluate on the latest Stratix 10 GX-series FPGAs and position its performance (and power-efficiency) against state-of-the-art systems such as ARM ThunderX2, NVIDIA Pascal/Volta/Ampere Teslaseries cards, and general-purpose manycore CPUs. Finally, we develop a performance model for our SEM-accelerator, which we use to project future FPGAs’ performance and role to accelerate CFD applications, ultimately answering the question: what characteristics would a perfect FPGA for CFD applications have?","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115643650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Efficient parallel CP decomposition with pairwise perturbation and multi-sweep dimension tree 基于两两摄动和多扫描维树的高效并行CP分解

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-10-22 DOI: 10.1109/IPDPS49936.2021.00049

Linjian Ma, Edgar Solomonik

{"title":"Efficient parallel CP decomposition with pairwise perturbation and multi-sweep dimension tree","authors":"Linjian Ma, Edgar Solomonik","doi":"10.1109/IPDPS49936.2021.00049","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00049","url":null,"abstract":"The widely used alternating least squares (ALS) algorithm for the canonical polyadic (CP) tensor decomposition is dominated in cost by the matricized-tensor times Khatri-Rao product (MTTKRP) kernel. This kernel is necessary to set up the quadratic optimization subproblems. State-of-the-art parallel ALS implementations use dimension trees to avoid redundant computations across MTTKRPs within each ALS sweep. In this paper, we propose two new parallel algorithms to accelerate CP-ALS. We introduce the multi-sweep dimension tree (MSDT) algorithm, which requires the contraction between an order N input tensor and the first-contracted input matrix once every $(N-1)/N$ sweeps. This algorithm reduces the leading order computational cost by a factor of $2(N-1)/N$ relative to the best previously known approach. In addition, we introduce a more communication-efficient approach to parallelizing an approximate CP-ALS algorithm, pairwise perturbation. This technique uses perturbative corrections to the subproblems rather than recomputing the contractions, and asymptotically accelerates ALS. Our benchmark results on 1024 processors on the Stampede2 supercomputer show that CP decomposition obtains a 1.25X speed-up from MSDT and a 1.94X speedup from pairwise perturbation compared to the state-of-the-art dimension-tree based CP-ALS implementations.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115756209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources 具有稀疏“离网”源的有限差分模板算子的时间阻塞

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-10-20 DOI: 10.1109/IPDPS49936.2021.00058

George Bisbas, F. Luporini, M. Louboutin, R. Nelson, G. Gorman, P. Kelly

{"title":"Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources","authors":"George Bisbas, F. Luporini, M. Louboutin, R. Nelson, G. Gorman, P. Kelly","doi":"10.1109/IPDPS49936.2021.00058","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00058","url":null,"abstract":"Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications’ stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid (“off-the-grid”). Our work is motivated by modelling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129959387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly 基因组组装的平行弦图构造与传递约简

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-10-20 DOI: 10.1109/IPDPS49936.2021.00060

Giulia Guidi, Oguz Selvitopi, Marquita Ellis, L. Oliker, K. Yelick, A. Buluç

{"title":"Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly","authors":"Giulia Guidi, Oguz Selvitopi, Marquita Ellis, L. Oliker, K. Yelick, A. Buluç","doi":"10.1109/IPDPS49936.2021.00060","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00060","url":null,"abstract":"One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2 – $1.3 times$ for the human genome and 1.5 – $1.9 times$ for C.elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5 – $13.3 times$ for the human genome and 18– $29 times$ for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132910510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures 重新审视霍夫曼编码:在现代GPU架构上实现极致性能

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-10-20 DOI: 10.1109/IPDPS49936.2021.00097

Jiannan Tian, Cody Rivera, S. Di, Jieyang Chen, Xin Liang, Dingwen Tao, F. Cappello

{"title":"Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures","authors":"Jiannan Tian, Cody Rivera, S. Di, Jieyang Chen, Xin Liang, Dingwen Tao, F. Cappello","doi":"10.1109/IPDPS49936.2021.00097","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00097","url":null,"abstract":"Today’s high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today’s HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is fourfold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multithreaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0× and 6.8× on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3× over the multithread encoder on two 28-core Xeon Platinum 8280 CPUs.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121443751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale 极端尺度下的通信避免和内存约束稀疏矩阵-矩阵乘法

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-10-16 DOI: 10.1109/IPDPS49936.2021.00018

Md Taufique Hussain, Oguz Selvitopi, A. Buluç, A. Azad

引用次数: 9