树分区缩减

ACM Transactions on Mathematical Software (TOMS) Pub Date : 2019-08-08 DOI:10.1145/3328731

A. P. Diéguez, M. Amor, R. Doallo

{"title":"树分区缩减","authors":"A. P. Diéguez, M. Amor, R. Doallo","doi":"10.1145/3328731","DOIUrl":null,"url":null,"abstract":"Solving tridiagonal linear-equation systems is a fundamental computing kernel in a wide range of scientific and engineering applications, and its computation can be modeled with parallel algorithms. These parallel solvers are typically designed to compute problems whose data fit in a common shared-memory space where all the cores taking part in the computation have access. However, when the problem size is large, data cannot be entirely stored in the common shared-memory space, and a high number of high-latency communications are performed. One alternative is to partition the problem among different memory spaces. At this point, conventional parallel algorithms do not facilitate the partition of computation in independent tiles, since each reduction depends on equations that may be in different tiles. This article proposes an algorithm based on a tree reduction, called the Tree Partitioning Reduction (TPR) method, which partitions the problem into independent slices that can be partially computed in parallel within different common shared-memory spaces. The TPR method can be implemented for any parallel and distributed programming paradigm. Furthermore, in this work, TPR is efficiently implemented for CUDA GPUs to solve large size problems, providing highly competitive performance results with respect to existing packages, being, on average, 22.03× faster than CUSPARSE.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"408 1","pages":"1 - 26"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Tree Partitioning Reduction\",\"authors\":\"A. P. Diéguez, M. Amor, R. Doallo\",\"doi\":\"10.1145/3328731\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Solving tridiagonal linear-equation systems is a fundamental computing kernel in a wide range of scientific and engineering applications, and its computation can be modeled with parallel algorithms. These parallel solvers are typically designed to compute problems whose data fit in a common shared-memory space where all the cores taking part in the computation have access. However, when the problem size is large, data cannot be entirely stored in the common shared-memory space, and a high number of high-latency communications are performed. One alternative is to partition the problem among different memory spaces. At this point, conventional parallel algorithms do not facilitate the partition of computation in independent tiles, since each reduction depends on equations that may be in different tiles. This article proposes an algorithm based on a tree reduction, called the Tree Partitioning Reduction (TPR) method, which partitions the problem into independent slices that can be partially computed in parallel within different common shared-memory spaces. The TPR method can be implemented for any parallel and distributed programming paradigm. Furthermore, in this work, TPR is efficiently implemented for CUDA GPUs to solve large size problems, providing highly competitive performance results with respect to existing packages, being, on average, 22.03× faster than CUSPARSE.\",\"PeriodicalId\":7036,\"journal\":{\"name\":\"ACM Transactions on Mathematical Software (TOMS)\",\"volume\":\"408 1\",\"pages\":\"1 - 26\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Mathematical Software (TOMS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3328731\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Mathematical Software (TOMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3328731","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

求解三对角线性方程组是广泛的科学和工程应用的基本计算核心，其计算可以用并行算法建模。这些并行求解器通常设计用于计算数据适合公共共享内存空间的问题，所有参与计算的核心都可以访问该空间。但是，当问题规模很大时，数据不能完全存储在公共共享内存空间中，并且需要执行大量的高延迟通信。一种替代方法是在不同的内存空间中对问题进行分区。在这一点上，传统的并行算法不便于在独立的块中划分计算，因为每次约简都依赖于可能在不同块中的方程。本文提出了一种基于树约简的算法，称为树分区约简(TPR)方法，该方法将问题划分为独立的片，可以在不同的公共共享内存空间中并行计算部分问题。TPR方法可以在任何并行和分布式编程范例中实现。此外，在这项工作中，TPR被有效地实现在CUDA gpu上，以解决大尺寸问题，相对于现有的软件包，提供了极具竞争力的性能结果，平均比CUSPARSE快22.03倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Tree Partitioning Reduction

Solving tridiagonal linear-equation systems is a fundamental computing kernel in a wide range of scientific and engineering applications, and its computation can be modeled with parallel algorithms. These parallel solvers are typically designed to compute problems whose data fit in a common shared-memory space where all the cores taking part in the computation have access. However, when the problem size is large, data cannot be entirely stored in the common shared-memory space, and a high number of high-latency communications are performed. One alternative is to partition the problem among different memory spaces. At this point, conventional parallel algorithms do not facilitate the partition of computation in independent tiles, since each reduction depends on equations that may be in different tiles. This article proposes an algorithm based on a tree reduction, called the Tree Partitioning Reduction (TPR) method, which partitions the problem into independent slices that can be partially computed in parallel within different common shared-memory spaces. The TPR method can be implemented for any parallel and distributed programming paradigm. Furthermore, in this work, TPR is efficiently implemented for CUDA GPUs to solve large size problems, providing highly competitive performance results with respect to existing packages, being, on average, 22.03× faster than CUSPARSE.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Mathematical Software (TOMS)

自引率

0.00%

发文量