Clustering cubes with binary dimensions in one pass

International Workshop on Data Warehousing and OLAP Pub Date : 2013-10-28 DOI:10.1145/2513190.2513192

Carlos Garcia-Alvarado, C. Ordonez

引用次数: 2

Abstract

Finding aggregations of records with high dimensionality in large data warehouses is a crucial and costly task. These groups of similar records are the result of partitions obtained with GROUP BYs. In this research, we focus on obtaining aggregations of groups of similar records by turning the problem into efficient binary clustering of a fact table as a relaxation of a GROUP BY clause. We present an efficient window-based Incremental K-Means algorithm in a relational database system implemented as a user-defined function. This variant is based on the Incremental K-Means algorithm. The speed up is achieved through the computation of sufficient statistics, multithreading, efficient distance computation and sparse matrix operations. Finally, the performance of our algorithm is compared against multiple variants of the K-Means algorithm. Our experiments show that our incremental K-Means algorithm achieves similar or even better results more quickly than the traditional K-Means algorithm.

查看原文本刊更多论文

一次聚类具有二进制维度的多维数据集

在大型数据仓库中查找高维记录的聚合是一项关键且代价高昂的任务。这些相似记录的组是使用GROUP BYs获得的分区的结果。在本研究中，我们着重于通过将问题转化为事实表的有效二值聚类，作为GROUP by子句的松弛，从而获得相似记录组的聚合。我们提出了一种高效的基于窗口的增量K-Means算法，该算法在关系数据库系统中实现为用户定义函数。这种变体基于增量K-Means算法。通过充分统计计算、多线程、高效距离计算和稀疏矩阵运算来提高速度。最后，将我们的算法与K-Means算法的多个变体进行了性能比较。我们的实验表明，我们的增量K-Means算法比传统的K-Means算法更快地获得了相似甚至更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Workshop on Data Warehousing and OLAP

自引率

0.00%

发文量