S-HOT: Scalable High-Order Tucker Decomposition

Proceedings of the Tenth ACM International Conference on Web Search and Data Mining Pub Date : 2017-02-02 DOI:10.1145/3018661.3018721

Jinoh Oh, Kijung Shin, E. Papalexakis, C. Faloutsos, Hwanjo Yu

{"title":"S-HOT: Scalable High-Order Tucker Decomposition","authors":"Jinoh Oh, Kijung Shin, E. Papalexakis, C. Faloutsos, Hwanjo Yu","doi":"10.1145/3018661.3018721","DOIUrl":null,"url":null,"abstract":"Multi-aspect data appear frequently in many web-related applications. For example, product reviews are quadruplets of (user, product, keyword, timestamp). How can we analyze such web-scale multi-aspect data? Can we analyze them on an off-the-shelf workstation with limited amount of memory? Tucker decomposition has been widely used for discovering patterns in relationships among entities in multi-aspect data, naturally expressed as high-order tensors. However, existing algorithms for Tucker decomposition have limited scalability, and especially, fail to decompose high-order tensors since they explicitly materialize intermediate data, whose size rapidly grows as the order increases (≥ 4). We call this problem M-Bottleneck (\"Materialization Bottleneck\"). To avoid M-Bottleneck, we propose S-HOT, a scalable high-order tucker decomposition method that employs the on-the-fly computation to minimize the materialized intermediate data. Moreover, S-HOT is designed for handling disk-resident tensors, too large to fit in memory, without loading them all in memory at once. We provide theoretical analysis on the amount of memory space and the number of scans of data required by S-HOT. In our experiments, S-HOT showed better scalability not only with the order but also with the dimensionality and the rank than baseline methods. In particular, S-HOT decomposed tensors 1000× larger than baseline methods in terms dimensionality. S- HOT also successfully analyzed real-world tensors that are both large-scale and high-order on an off-the-shelf workstation with limited amount of memory, while baseline methods failed. The source code of S-HOT is publicly available at http://dm.postech.ac.kr/shot to encourage reproducibility.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3018661.3018721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 45

Abstract

Multi-aspect data appear frequently in many web-related applications. For example, product reviews are quadruplets of (user, product, keyword, timestamp). How can we analyze such web-scale multi-aspect data? Can we analyze them on an off-the-shelf workstation with limited amount of memory? Tucker decomposition has been widely used for discovering patterns in relationships among entities in multi-aspect data, naturally expressed as high-order tensors. However, existing algorithms for Tucker decomposition have limited scalability, and especially, fail to decompose high-order tensors since they explicitly materialize intermediate data, whose size rapidly grows as the order increases (≥ 4). We call this problem M-Bottleneck ("Materialization Bottleneck"). To avoid M-Bottleneck, we propose S-HOT, a scalable high-order tucker decomposition method that employs the on-the-fly computation to minimize the materialized intermediate data. Moreover, S-HOT is designed for handling disk-resident tensors, too large to fit in memory, without loading them all in memory at once. We provide theoretical analysis on the amount of memory space and the number of scans of data required by S-HOT. In our experiments, S-HOT showed better scalability not only with the order but also with the dimensionality and the rank than baseline methods. In particular, S-HOT decomposed tensors 1000× larger than baseline methods in terms dimensionality. S- HOT also successfully analyzed real-world tensors that are both large-scale and high-order on an off-the-shelf workstation with limited amount of memory, while baseline methods failed. The source code of S-HOT is publicly available at http://dm.postech.ac.kr/shot to encourage reproducibility.

查看原文本刊更多论文

S-HOT:可扩展的高阶塔克分解

多方面数据经常出现在许多与web相关的应用程序中。例如，产品评论是(用户、产品、关键字、时间戳)的四联体。我们如何分析这种网络规模的多方面数据?我们可以在内存有限的现成工作站上分析它们吗?Tucker分解已被广泛用于发现多向数据中实体之间关系的模式，这些模式自然地表示为高阶张量。然而，现有的Tucker分解算法具有有限的可扩展性，特别是不能分解高阶张量，因为它们显式地物化中间数据，其大小随着阶数的增加而迅速增长(≥4)。我们称此问题为m -瓶颈(“物化瓶颈”)。为了避免m -瓶颈，我们提出了S-HOT，一种可扩展的高阶tucker分解方法，该方法使用实时计算来最小化物化中间数据。此外，S-HOT是为处理磁盘驻留张量而设计的，这些张量太大而无法装入内存，而无需一次将它们全部加载到内存中。我们提供了S-HOT所需的内存空间量和数据扫描次数的理论分析。在我们的实验中，S-HOT不仅在顺序上，而且在维数和秩上都表现出比基线方法更好的可扩展性。特别是，S-HOT分解的张量在维数上比基线方法大1000倍。S- HOT还成功地分析了现实世界中具有有限内存的大规模和高阶张量，而基线方法失败了。S-HOT的源代码可在http://dm.postech.ac.kr/shot上公开获得，以鼓励再现性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量