多核和多核架构上的推测并行反向Cuthill-McKee重新排序

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI:10.1109/IPDPS49936.2021.00080

Daniel Mlakar, Martin Winter, Mathias Parger, M. Steinberger

{"title":"多核和多核架构上的推测并行反向Cuthill-McKee重新排序","authors":"Daniel Mlakar, Martin Winter, Mathias Parger, M. Steinberger","doi":"10.1109/IPDPS49936.2021.00080","DOIUrl":null,"url":null,"abstract":"Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA’s single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Speculative Parallel Reverse Cuthill-McKee Reordering on Multi- and Many-core Architectures\",\"authors\":\"Daniel Mlakar, Martin Winter, Mathias Parger, M. Steinberger\",\"doi\":\"10.1109/IPDPS49936.2021.00080\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA’s single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.\",\"PeriodicalId\":372234,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"113 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS49936.2021.00080\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

稀疏矩阵的带宽缩减用于减少线性求解器的填充，并提高其他稀疏矩阵运算的性能，例如迭代求解器中的稀疏矩阵向量乘法。为了计算减少带宽的排列，通常采用反向Cuthill-McKee (RCM)重排序，由于其核心本质上是串行的，因此对并行化具有挑战性。由于多核架构(如GPU)提供低于标准的单线程性能，并且通常仅通过慢速内存总线连接到高性能CPU核心，因此在GPU上计算RCM或将数据移动到CPU都不是可行的选择。然而，对矩阵进行重新排序(可能在操作之间进行多次排序)对于高吞吐量可能是必不可少的。尽管如此，据我们所知，我们是第一个提出可以在多核cpu和多核gpu上执行的RCM实现，将计算转移到数据上，而不是反过来。我们的算法将RCM并行化为大多数独立的节点批次。对于每个批处理，单个cpu线程/ GPU线程块推测性地发现子节点并根据RCM算法对它们进行排序。在编写它们的排列之前，我们重新评估发现并构建新的批次。为了增加并行性和减少依赖性，我们沿着连续的批创建一个信令链，并引入早期的信令条件。与并行工作队列相结合，新批按顺序启动，结果RCM排列与基本的单线程算法相同。我们提出了在GPU上运行的第一个RCM实现。与NVIDIA的单线程cuSolver RCM实现相比，它实现了几个数量级的加速，并且比以前的并行CPU方法快得多。我们的结果对于多核体系结构尤其重要，因为现在可以将RCM重新排序到稀疏矩阵操作序列中，而不会造成重大的性能损失。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speculative Parallel Reverse Cuthill-McKee Reordering on Multi- and Many-core Architectures

Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA’s single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量