Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI:10.1145/3085504.3085512

Chengjie Qin, Florin Rusu

{"title":"Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics","authors":"Chengjie Qin, Florin Rusu","doi":"10.1145/3085504.3085512","DOIUrl":null,"url":null,"abstract":"Big Model analytics tackles the training of massive models that go beyond the available memory of a single computing device, e.g., CPU or GPU. It generalizes Big Data analytics which is targeted at how to train memory-resident models over out-of-memory training data. In this paper, we propose an in-database solution for Big Model analytics. We identify dot-product as the primary operation for training generalized linear models and introduce the first array-relation dot-product join database operator between a set of sparse arrays and a dense relation. This is a constrained formulation of the extensively studied sparse matrix vector multiplication (SpMV) kernel. The paramount challenge in designing the dot-product join operator is how to optimally schedule access to the dense relation based on the non-contiguous entries in the sparse arrays. We propose a practical solution characterized by two technical contributions---dynamic batch processing and array reordering. We devise three heuristics -- LSH, Radix, and K-center -- for array reordering and analyze them thoroughly. We execute extensive experiments over synthetic and real data that confirm the minimal overhead the operator incurs when sufficient memory is available and the graceful degradation it suffers as memory becomes scarce. Moreover, dot-product join achieves an order of magnitude reduction in execution time over alternative solutions.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3085504.3085512","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Big Model analytics tackles the training of massive models that go beyond the available memory of a single computing device, e.g., CPU or GPU. It generalizes Big Data analytics which is targeted at how to train memory-resident models over out-of-memory training data. In this paper, we propose an in-database solution for Big Model analytics. We identify dot-product as the primary operation for training generalized linear models and introduce the first array-relation dot-product join database operator between a set of sparse arrays and a dense relation. This is a constrained formulation of the extensively studied sparse matrix vector multiplication (SpMV) kernel. The paramount challenge in designing the dot-product join operator is how to optimally schedule access to the dense relation based on the non-contiguous entries in the sparse arrays. We propose a practical solution characterized by two technical contributions---dynamic batch processing and array reordering. We devise three heuristics -- LSH, Radix, and K-center -- for array reordering and analyze them thoroughly. We execute extensive experiments over synthetic and real data that confirm the minimal overhead the operator incurs when sufficient memory is available and the graceful degradation it suffers as memory becomes scarce. Moreover, dot-product join achieves an order of magnitude reduction in execution time over alternative solutions.

查看原文本刊更多论文

点积连接:用于大模型分析的可扩展数据库内线性代数

大模型分析解决了大量模型的训练，这些模型超出了单个计算设备(例如CPU或GPU)的可用内存。它概括了大数据分析，其目标是如何在内存外的训练数据上训练内存驻留模型。在本文中，我们提出了一个大模型分析的数据库内解决方案。我们确定了点积作为训练广义线性模型的主要操作，并在一组稀疏数组和一个密集关系之间引入了第一个数组关系点积连接数据库算子。这是广泛研究的稀疏矩阵向量乘法(SpMV)核的约束公式。设计点积连接运算符时面临的最大挑战是如何基于稀疏数组中的不连续条目优化调度对密集关系的访问。我们提出了一个实用的解决方案，其特点是两个技术贡献——动态批处理和数组重排序。我们设计了三种启发式方法——LSH、Radix和K-center——用于数组重新排序，并对它们进行了彻底的分析。我们对合成数据和真实数据进行了大量的实验，以确认当有足够的内存可用时，操作符的开销最小，以及当内存变得稀缺时，操作符遭受的优雅降级。此外，与其他解决方案相比，点积连接在执行时间上减少了一个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 29th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量