S-HOT: Scalable High-Order Tucker Decomposition

Jinoh Oh, Kijung Shin, E. Papalexakis, C. Faloutsos, Hwanjo Yu
{"title":"S-HOT: Scalable High-Order Tucker Decomposition","authors":"Jinoh Oh, Kijung Shin, E. Papalexakis, C. Faloutsos, Hwanjo Yu","doi":"10.1145/3018661.3018721","DOIUrl":null,"url":null,"abstract":"Multi-aspect data appear frequently in many web-related applications. For example, product reviews are quadruplets of (user, product, keyword, timestamp). How can we analyze such web-scale multi-aspect data? Can we analyze them on an off-the-shelf workstation with limited amount of memory? Tucker decomposition has been widely used for discovering patterns in relationships among entities in multi-aspect data, naturally expressed as high-order tensors. However, existing algorithms for Tucker decomposition have limited scalability, and especially, fail to decompose high-order tensors since they explicitly materialize intermediate data, whose size rapidly grows as the order increases (≥ 4). We call this problem M-Bottleneck (\"Materialization Bottleneck\"). To avoid M-Bottleneck, we propose S-HOT, a scalable high-order tucker decomposition method that employs the on-the-fly computation to minimize the materialized intermediate data. Moreover, S-HOT is designed for handling disk-resident tensors, too large to fit in memory, without loading them all in memory at once. We provide theoretical analysis on the amount of memory space and the number of scans of data required by S-HOT. In our experiments, S-HOT showed better scalability not only with the order but also with the dimensionality and the rank than baseline methods. In particular, S-HOT decomposed tensors 1000× larger than baseline methods in terms dimensionality. S- HOT also successfully analyzed real-world tensors that are both large-scale and high-order on an off-the-shelf workstation with limited amount of memory, while baseline methods failed. The source code of S-HOT is publicly available at http://dm.postech.ac.kr/shot to encourage reproducibility.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3018661.3018721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 45

Abstract

Multi-aspect data appear frequently in many web-related applications. For example, product reviews are quadruplets of (user, product, keyword, timestamp). How can we analyze such web-scale multi-aspect data? Can we analyze them on an off-the-shelf workstation with limited amount of memory? Tucker decomposition has been widely used for discovering patterns in relationships among entities in multi-aspect data, naturally expressed as high-order tensors. However, existing algorithms for Tucker decomposition have limited scalability, and especially, fail to decompose high-order tensors since they explicitly materialize intermediate data, whose size rapidly grows as the order increases (≥ 4). We call this problem M-Bottleneck ("Materialization Bottleneck"). To avoid M-Bottleneck, we propose S-HOT, a scalable high-order tucker decomposition method that employs the on-the-fly computation to minimize the materialized intermediate data. Moreover, S-HOT is designed for handling disk-resident tensors, too large to fit in memory, without loading them all in memory at once. We provide theoretical analysis on the amount of memory space and the number of scans of data required by S-HOT. In our experiments, S-HOT showed better scalability not only with the order but also with the dimensionality and the rank than baseline methods. In particular, S-HOT decomposed tensors 1000× larger than baseline methods in terms dimensionality. S- HOT also successfully analyzed real-world tensors that are both large-scale and high-order on an off-the-shelf workstation with limited amount of memory, while baseline methods failed. The source code of S-HOT is publicly available at http://dm.postech.ac.kr/shot to encourage reproducibility.
S-HOT:可扩展的高阶塔克分解
多方面数据经常出现在许多与web相关的应用程序中。例如,产品评论是(用户、产品、关键字、时间戳)的四联体。我们如何分析这种网络规模的多方面数据?我们可以在内存有限的现成工作站上分析它们吗?Tucker分解已被广泛用于发现多向数据中实体之间关系的模式,这些模式自然地表示为高阶张量。然而,现有的Tucker分解算法具有有限的可扩展性,特别是不能分解高阶张量,因为它们显式地物化中间数据,其大小随着阶数的增加而迅速增长(≥4)。我们称此问题为m -瓶颈(“物化瓶颈”)。为了避免m -瓶颈,我们提出了S-HOT,一种可扩展的高阶tucker分解方法,该方法使用实时计算来最小化物化中间数据。此外,S-HOT是为处理磁盘驻留张量而设计的,这些张量太大而无法装入内存,而无需一次将它们全部加载到内存中。我们提供了S-HOT所需的内存空间量和数据扫描次数的理论分析。在我们的实验中,S-HOT不仅在顺序上,而且在维数和秩上都表现出比基线方法更好的可扩展性。特别是,S-HOT分解的张量在维数上比基线方法大1000倍。S- HOT还成功地分析了现实世界中具有有限内存的大规模和高阶张量,而基线方法失败了。S-HOT的源代码可在http://dm.postech.ac.kr/shot上公开获得,以鼓励再现性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信