A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization

Qinglei Cao, Rabab Alomairy, Yu Pei, G. Bosilca, H. Ltaief, D. Keyes, J. Dongarra
{"title":"A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization","authors":"Qinglei Cao, Rabab Alomairy, Yu Pei, G. Bosilca, H. Ltaief, D. Keyes, J. Dongarra","doi":"10.1109/ipdps53621.2022.00047","DOIUrl":null,"url":null,"abstract":"We present a general framework that couples the PaRSEC runtime system and the HiCMA numerical library to solve challenging 3D data-sparse problems. Though formally dense, many matrix operators possess a rank structured property that can be exploited during the most time-consuming computational phase, i.e., the matrix factorization. In particular, this work highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by compressing the dense operator. Using Tile Low-Rank (TLR) approximation, our approach consists in capturing the most significant information in each tile of the matrix using a threshold which satisfies the application's accuracy requirements. Matrix operations are performed on the compressed data layout, reducing memory footprint and algorithmic complexity. Our proposed software solution accommodates a range of traditional data structures of linear algebra, i.e., from dense and data-sparse to sparse, within a single matrix operation. Separation of concerns is at the heart: hardware-agnostic implementation, asynchronous execution with a dynamic runtime system, and high performance numerical kernels, to prepare scientific applications to embrace exascale opportunities. This ambition necessitates extensions to PaRSEC that incorporate information related to data structure and rank distribution into the runtime decision-making. We introduce two runtime optimizations to address the challenges encountered when confronted with a large rank disparity: (1) a trimming procedure performed at runtime to cut away data dependencies from the directed acyclic graph discovered to be no longer required after compression and (2) a rank-aware diamond-shaped data distribution to mitigate the load imbalance overheads, reduce data movement, and conserve memory foot-print. We assess our implementation using 3D unstructured mesh deformation based on Radial Basis Function (RBF) interpolation. We report performance results on two different high-performance supercomputers and compare against existing state-of-the-art implementation. Our implementation shows up to 7-fold on Shaheen II and 9-fold on Fugaku performance superiority in situations where the 3D unstructured mesh deformation application renders a matrix operator with low density. Our software framework solves a formally dense 3D problem with 52M mesh points on 65K cores in about half an hour. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system, by synergistically bridging matrix algebra libraries with scientific applications.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ipdps53621.2022.00047","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

We present a general framework that couples the PaRSEC runtime system and the HiCMA numerical library to solve challenging 3D data-sparse problems. Though formally dense, many matrix operators possess a rank structured property that can be exploited during the most time-consuming computational phase, i.e., the matrix factorization. In particular, this work highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by compressing the dense operator. Using Tile Low-Rank (TLR) approximation, our approach consists in capturing the most significant information in each tile of the matrix using a threshold which satisfies the application's accuracy requirements. Matrix operations are performed on the compressed data layout, reducing memory footprint and algorithmic complexity. Our proposed software solution accommodates a range of traditional data structures of linear algebra, i.e., from dense and data-sparse to sparse, within a single matrix operation. Separation of concerns is at the heart: hardware-agnostic implementation, asynchronous execution with a dynamic runtime system, and high performance numerical kernels, to prepare scientific applications to embrace exascale opportunities. This ambition necessitates extensions to PaRSEC that incorporate information related to data structure and rank distribution into the runtime decision-making. We introduce two runtime optimizations to address the challenges encountered when confronted with a large rank disparity: (1) a trimming procedure performed at runtime to cut away data dependencies from the directed acyclic graph discovered to be no longer required after compression and (2) a rank-aware diamond-shaped data distribution to mitigate the load imbalance overheads, reduce data movement, and conserve memory foot-print. We assess our implementation using 3D unstructured mesh deformation based on Radial Basis Function (RBF) interpolation. We report performance results on two different high-performance supercomputers and compare against existing state-of-the-art implementation. Our implementation shows up to 7-fold on Shaheen II and 9-fold on Fugaku performance superiority in situations where the 3D unstructured mesh deformation application renders a matrix operator with low density. Our software framework solves a formally dense 3D problem with 52M mesh points on 65K cores in about half an hour. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system, by synergistically bridging matrix algebra libraries with scientific applications.
一种利用低秩Cholesky分解中数据稀疏性的框架
我们提出了一个通用框架,结合PaRSEC运行时系统和HiCMA数值库来解决具有挑战性的三维数据稀疏问题。虽然形式上密集,但许多矩阵算子具有秩结构性质,可以在最耗时的计算阶段(即矩阵分解)中利用。特别地,这项工作强调了由基于任务的编程模型支持的软件包如何通过压缩密集操作符来处理异构工作负载。使用Tile Low-Rank (TLR)近似,我们的方法包括使用满足应用程序精度要求的阈值捕获矩阵每个Tile中最重要的信息。矩阵运算在压缩的数据布局上执行,减少了内存占用和算法复杂性。我们提出的软件解决方案可以在单个矩阵操作中容纳线性代数的一系列传统数据结构,即从密集和数据稀疏到稀疏。关注点分离是核心:与硬件无关的实现、动态运行时系统的异步执行和高性能数值内核,使科学应用程序准备好迎接百亿亿次的机会。这个目标需要对PaRSEC进行扩展,将与数据结构和等级分布相关的信息合并到运行时决策中。我们引入了两种运行时优化来解决面对大秩差时遇到的挑战:(1)在运行时执行的修整过程,以从压缩后发现不再需要的有向无环图中删除数据依赖关系;(2)具有秩感知的菱形数据分布,以减轻负载不平衡开销,减少数据移动,并节省内存占用空间。我们使用基于径向基函数(RBF)插值的三维非结构化网格变形来评估我们的实现。我们报告了两台不同的高性能超级计算机的性能结果,并与现有的最先进的实现进行了比较。在3D非结构化网格变形应用程序呈现低密度矩阵算子的情况下,我们的实现显示出高达7倍的Shaheen II和9倍的Fugaku性能优势。我们的软件框架在大约半小时内解决了65K核上52M网格点的正式密集3D问题。这项多学科的工作强调了运行时系统需要超越它们在大规模并行硬件系统上的任务调度的主要责任,通过协同连接矩阵代数库和科学应用程序。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信