Distributed Latent Dirichlet Allocation on Streams

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-07-03 DOI:10.1145/3451528

Yunyan Guo, Jianzhong Li

{"title":"Distributed Latent Dirichlet Allocation on Streams","authors":"Yunyan Guo, Jianzhong Li","doi":"10.1145/3451528","DOIUrl":null,"url":null,"abstract":"Latent Dirichlet Allocation (LDA) has been widely used for topic modeling, with applications spanning various areas such as natural language processing and information retrieval. While LDA on small and static datasets has been extensively studied, several real-world challenges are posed in practical scenarios where datasets are often huge and are gathered in a streaming fashion. As the state-of-the-art LDA algorithm on streams, Streaming Variational Bayes (SVB) introduced Bayesian updating to provide a streaming procedure. However, the utility of SVB is limited in applications since it ignored three challenges of processing real-world streams: topic evolution, data turbulence, and real-time inference. In this article, we propose a novel distributed LDA algorithm—referred to as StreamFed-LDA—to deal with challenges on streams. For topic modeling of streaming data, the ability to capture evolving topics is essential for practical online inference. To achieve this goal, StreamFed-LDA is based on a specialized framework that supports lifelong (continual) learning of evolving topics. On the other hand, data turbulence is commonly present in streams due to real-life events. In that case, the design of StreamFed-LDA allows the model to learn new characteristics from the most recent data while maintaining the historical information. On massive streaming data, it is difficult and crucial to provide real-time inference results. To increase the throughput and reduce the latency, StreamFed-LDA introduces additional techniques that substantially reduce both computation and communication costs in distributed systems. Experiments on four real-world datasets show that the proposed framework achieves significantly better performance of online inference compared with the baselines. At the same time, StreamFed-LDA also reduces the latency by orders of magnitudes in real-world datasets.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"06 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data (TKDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3451528","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Latent Dirichlet Allocation (LDA) has been widely used for topic modeling, with applications spanning various areas such as natural language processing and information retrieval. While LDA on small and static datasets has been extensively studied, several real-world challenges are posed in practical scenarios where datasets are often huge and are gathered in a streaming fashion. As the state-of-the-art LDA algorithm on streams, Streaming Variational Bayes (SVB) introduced Bayesian updating to provide a streaming procedure. However, the utility of SVB is limited in applications since it ignored three challenges of processing real-world streams: topic evolution, data turbulence, and real-time inference. In this article, we propose a novel distributed LDA algorithm—referred to as StreamFed-LDA—to deal with challenges on streams. For topic modeling of streaming data, the ability to capture evolving topics is essential for practical online inference. To achieve this goal, StreamFed-LDA is based on a specialized framework that supports lifelong (continual) learning of evolving topics. On the other hand, data turbulence is commonly present in streams due to real-life events. In that case, the design of StreamFed-LDA allows the model to learn new characteristics from the most recent data while maintaining the historical information. On massive streaming data, it is difficult and crucial to provide real-time inference results. To increase the throughput and reduce the latency, StreamFed-LDA introduces additional techniques that substantially reduce both computation and communication costs in distributed systems. Experiments on four real-world datasets show that the proposed framework achieves significantly better performance of online inference compared with the baselines. At the same time, StreamFed-LDA also reduces the latency by orders of magnitudes in real-world datasets.

查看原文本刊更多论文

流上的分布式潜Dirichlet分配

潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)在主题建模中得到了广泛的应用，在自然语言处理和信息检索等领域都有广泛的应用。虽然对小型静态数据集的LDA已经进行了广泛的研究，但在实际场景中，数据集通常是巨大的，并且以流方式收集，因此提出了几个现实世界的挑战。流变分贝叶斯(Streaming Variational Bayes, SVB)是目前流上最先进的LDA算法，它引入贝叶斯更新来提供一个流处理过程。然而，SVB的实用性在应用中受到限制，因为它忽略了处理现实世界流的三个挑战:主题演变、数据湍流和实时推理。在本文中，我们提出了一种新的分布式LDA算法——称为streamfed -LDA——来处理流上的挑战。对于流数据的主题建模，捕获不断变化的主题的能力对于实际的在线推理是必不可少的。为了实现这一目标，StreamFed-LDA基于一个专门的框架，该框架支持对不断发展的主题进行终身(持续)学习。另一方面，由于现实生活中的事件，数据湍流通常存在于流中。在这种情况下，StreamFed-LDA的设计允许模型从最新的数据中学习新的特征，同时保持历史信息。在海量流数据中，提供实时的推理结果是非常困难和关键的。为了提高吞吐量和减少延迟，StreamFed-LDA引入了额外的技术，这些技术大大降低了分布式系统中的计算和通信成本。在四个真实数据集上的实验表明，与基线相比，该框架的在线推理性能显著提高。与此同时，StreamFed-LDA还将现实世界数据集的延迟降低了几个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Knowledge Discovery from Data (TKDD)

自引率

0.00%

发文量