Real-Time Clustering for Large Sparse Online Visitor Data

Proceedings of The Web Conference 2020 Pub Date : 2020-04-20 DOI:10.1145/3366423.3380183

G. Chan, F. Du, Ryan A. Rossi, Anup B. Rao, Eunyee Koh, Cláudio T. Silva, J. Freire

{"title":"Real-Time Clustering for Large Sparse Online Visitor Data","authors":"G. Chan, F. Du, Ryan A. Rossi, Anup B. Rao, Eunyee Koh, Cláudio T. Silva, J. Freire","doi":"10.1145/3366423.3380183","DOIUrl":null,"url":null,"abstract":"Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The Web Conference 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366423.3380183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.

查看原文本刊更多论文

大型稀疏在线访问者数据的实时聚类

在线访问者行为通常被建模为一个大的稀疏矩阵，其中行表示访问者，列表示行为。为了发现具有不同层次结构的客户群，营销人员通常需要将数据聚类在不同的细分中。这种分析需要聚类算法对用户参数变化提供实时响应，这是当前技术无法支持的。本文针对大规模稀疏数据，提出了一种实时聚类算法——稀疏密度峰算法。它对输入点进行预处理以计算注释和集群分配的层次结构。虽然赋值只是对点进行一次扫描，但简单的预处理需要测量所有的成对距离，这会产生二次计算开销，并且对于任何中等大小的数据都是不可行的。因此，我们提出了一种基于MinHash和LSH的新方法，可以提供快速准确的估计。我们还描述了一个在Spark上解决数据倾斜和内存使用的高效实现。我们的实验表明，与直接的MinHash和LSH实现相比，我们的方法(1)在真实数据集的准确性方面提供了更好的近似，(2)在端到端聚类管道中实现了20倍的加速，(3)可以使用较小的内存维持计算。最后，我们提供了一个界面，从数百万在线访客记录中实时探索客户细分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of The Web Conference 2020

自引率

0.00%

发文量