Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-20 DOI:10.1109/TPDS.2025.3553092

Jan Laukemann;Ahmed E. Helal;S. Isaac Geronimo Anderson;Fabio Checconi;Yongseok Soh;Jesmin Jahan Tithi;Teresa Ranadive;Brian J. Gravelle;Fabrizio Petrini;Jee Choi

{"title":"Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation","authors":"Jan Laukemann;Ahmed E. Helal;S. Isaac Geronimo Anderson;Fabio Checconi;Yongseok Soh;Jesmin Jahan Tithi;Teresa Ranadive;Brian J. Gravelle;Fabrizio Petrini;Jee Choi","doi":"10.1109/TPDS.2025.3553092","DOIUrl":null,"url":null,"abstract":"High-dimensional sparse data emerge in many critical application domains such as healthcare and cybersecurity. To extract meaningful insights from massive volumes of these multi-dimensional data, scientists employ unsupervised analysis tools based on tensor decomposition (TD) methods. However, real-world sparse tensors exhibit highly irregular shapes and data distributions, which pose significant challenges for making efficient use of modern parallel processors. This study breaks the prevailing assumption that compressing sparse tensors into coarse-grained structures (i.e., tensor slices or blocks) or along a particular dimension/mode (i.e., mode-specific) is more efficient than keeping them in a fine-grained, mode-agnostic form. Our novel sparse tensor representation, Adaptive Linearized Tensor Order (<inline-formula><tex-math>${\\sf ALTO}$</tex-math></inline-formula>), encodes tensors in a compact format that can be easily streamed from memory and is amenable to both caching and parallel execution. In contrast to existing compressed tensor formats, <inline-formula><tex-math>${\\sf ALTO}$</tex-math></inline-formula> constructs one tensor copy that is agnostic to both the mode orientation and the irregular distribution of nonzero elements. To demonstrate the efficacy of <inline-formula><tex-math>${\\sf ALTO}$</tex-math></inline-formula>, we accelerate popular TD methods that compute the Canonical Polyadic Decomposition (CPD) model across different types of sparse tensors. We propose a set of parallel TD algorithms that exploit the inherent data reuse of tensor computations to substantially reduce synchronization overhead, decrease memory footprint, and improve parallel performance. Additionally, we characterize the major execution bottlenecks of TD methods on multiple generations of the latest Intel Xeon Scalable processors, including Sapphire Rapids CPUs, and introduce dynamic adaptation heuristics to automatically select the best algorithm based on the sparse tensor characteristics. Across a diverse set of real-world data sets, <inline-formula><tex-math>${\\sf ALTO}$</tex-math></inline-formula> outperforms the state-of-the-art approaches, achieving more than an order-of-magnitude speedup over the best mode-agnostic formats. Compared to the best mode-specific formats, which require multiple tensor copies, <inline-formula><tex-math>${\\sf ALTO}$</tex-math></inline-formula>achieves <inline-formula><tex-math>$5.1\\times$</tex-math></inline-formula> geometric mean speedup at a fraction (25% ) of their storage costs. Moreover, <inline-formula><tex-math>${\\sf ALTO}$</tex-math></inline-formula> obtains <inline-formula><tex-math>$8.4\\times$</tex-math></inline-formula> geometric mean speedup over the state-of-the-art memoization approach, which reduces computations by using extra memory, while requiring 14% of its memory consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"1025-1041"},"PeriodicalIF":5.6000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10935716/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

High-dimensional sparse data emerge in many critical application domains such as healthcare and cybersecurity. To extract meaningful insights from massive volumes of these multi-dimensional data, scientists employ unsupervised analysis tools based on tensor decomposition (TD) methods. However, real-world sparse tensors exhibit highly irregular shapes and data distributions, which pose significant challenges for making efficient use of modern parallel processors. This study breaks the prevailing assumption that compressing sparse tensors into coarse-grained structures (i.e., tensor slices or blocks) or along a particular dimension/mode (i.e., mode-specific) is more efficient than keeping them in a fine-grained, mode-agnostic form. Our novel sparse tensor representation, Adaptive Linearized Tensor Order (

${\sf ALTO}$

), encodes tensors in a compact format that can be easily streamed from memory and is amenable to both caching and parallel execution. In contrast to existing compressed tensor formats,

${\sf ALTO}$

constructs one tensor copy that is agnostic to both the mode orientation and the irregular distribution of nonzero elements. To demonstrate the efficacy of

${\sf ALTO}$

, we accelerate popular TD methods that compute the Canonical Polyadic Decomposition (CPD) model across different types of sparse tensors. We propose a set of parallel TD algorithms that exploit the inherent data reuse of tensor computations to substantially reduce synchronization overhead, decrease memory footprint, and improve parallel performance. Additionally, we characterize the major execution bottlenecks of TD methods on multiple generations of the latest Intel Xeon Scalable processors, including Sapphire Rapids CPUs, and introduce dynamic adaptation heuristics to automatically select the best algorithm based on the sparse tensor characteristics. Across a diverse set of real-world data sets,

${\sf ALTO}$

outperforms the state-of-the-art approaches, achieving more than an order-of-magnitude speedup over the best mode-agnostic formats. Compared to the best mode-specific formats, which require multiple tensor copies,

${\sf ALTO}$

achieves

$5.1\times$

geometric mean speedup at a fraction (25% ) of their storage costs. Moreover,

${\sf ALTO}$

obtains

$8.4\times$

geometric mean speedup over the state-of-the-art memoization approach, which reduces computations by using extra memory, while requiring 14% of its memory consumption.

查看原文本刊更多论文

自适应线性化表示加速稀疏张量分解

高维稀疏数据出现在许多关键应用领域，如医疗保健和网络安全。为了从这些大量的多维数据中提取有意义的见解，科学家们采用了基于张量分解（TD）方法的无监督分析工具。然而，现实世界的稀疏张量表现出高度不规则的形状和数据分布，这对有效使用现代并行处理器构成了重大挑战。这项研究打破了将稀疏张量压缩成粗粒度结构（即张量切片或块）或沿着特定维度/模式（即特定于模式）比将它们保持在细粒度，模式不可知的形式更有效的普遍假设。我们新颖的稀疏张量表示，自适应线性化张量顺序（${\sf ALTO}$），以紧凑的格式编码张量，可以很容易地从内存中流式传输，并且适合缓存和并行执行。与现有的压缩张量格式相比，${\sf ALTO}$构建了一个张量副本，该副本与模式方向和非零元素的不规则分布无关。为了证明${\sf ALTO}$的有效性，我们加速了流行的TD方法，该方法在不同类型的稀疏张量上计算Canonical Polyadic分解（CPD）模型。我们提出了一套并行TD算法，利用张量计算固有的数据重用来大幅降低同步开销，减少内存占用，并提高并行性能。此外，我们描述了TD方法在多代最新英特尔至强可扩展处理器（包括Sapphire Rapids cpu）上的主要执行瓶颈，并引入动态自适应启发式算法，根据稀疏张量特征自动选择最佳算法。在不同的现实世界数据集中，${\sf ALTO}$优于最先进的方法，比最佳的模式无关格式实现了超过数量级的加速。与需要多个张量副本的最佳模式特定格式相比，${\sf ALTO}$以一小部分（25%）的存储成本实现了$5.1\倍的几何平均加速。此外，与最先进的记忆方法相比，${\sf ALTO}$获得了$8.4\倍的几何平均加速，这种方法通过使用额外的内存来减少计算，同时需要14%的内存消耗。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.