Recurrent Coupled Topic Modeling over Sequential Documents

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-06-23 DOI:10.1145/3451530

Jinjin Guo, Longbing Cao, Zhiguo Gong

{"title":"Recurrent Coupled Topic Modeling over Sequential Documents","authors":"Jinjin Guo, Longbing Cao, Zhiguo Gong","doi":"10.1145/3451530","DOIUrl":null,"url":null,"abstract":"The abundant sequential documents such as online archival, social media, and news feeds are streamingly updated, where each chunk of documents is incorporated with smoothly evolving yet dependent topics. Such digital texts have attracted extensive research on dynamic topic modeling to infer hidden evolving topics and their temporal dependencies. However, most of the existing approaches focus on single-topic-thread evolution and ignore the fact that a current topic may be coupled with multiple relevant prior topics. In addition, these approaches also incur the intractable inference problem when inferring latent parameters, resulting in a high computational cost and performance degradation. In this work, we assume that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution. Our method models the dependencies between evolving topics and thoroughly encodes their complex multi-couplings across time steps. To conquer the intractable inference challenge, a new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics. A fully conjugate model is thus obtained to guarantee the effectiveness and efficiency of the inference technique. A novel Gibbs sampler with a backward–forward filter algorithm efficiently learns latent time-evolving parameters in a closed-form. In addition, the latent Indian Buffet Process compound distribution is exploited to automatically infer the overall topic number and customize the sparse topic proportions for each sequential document without bias. The proposed method is evaluated on both synthetic and real-world datasets against the competitive baselines, demonstrating its superiority over the baselines in terms of the low per-word perplexity, high coherent topics, and better document time prediction.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data (TKDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3451530","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The abundant sequential documents such as online archival, social media, and news feeds are streamingly updated, where each chunk of documents is incorporated with smoothly evolving yet dependent topics. Such digital texts have attracted extensive research on dynamic topic modeling to infer hidden evolving topics and their temporal dependencies. However, most of the existing approaches focus on single-topic-thread evolution and ignore the fact that a current topic may be coupled with multiple relevant prior topics. In addition, these approaches also incur the intractable inference problem when inferring latent parameters, resulting in a high computational cost and performance degradation. In this work, we assume that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution. Our method models the dependencies between evolving topics and thoroughly encodes their complex multi-couplings across time steps. To conquer the intractable inference challenge, a new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics. A fully conjugate model is thus obtained to guarantee the effectiveness and efficiency of the inference technique. A novel Gibbs sampler with a backward–forward filter algorithm efficiently learns latent time-evolving parameters in a closed-form. In addition, the latent Indian Buffet Process compound distribution is exploited to automatically infer the overall topic number and customize the sparse topic proportions for each sequential document without bias. The proposed method is evaluated on both synthetic and real-world datasets against the competitive baselines, demonstrating its superiority over the baselines in terms of the low per-word perplexity, high coherent topics, and better document time prediction.

查看原文本刊更多论文

大量的顺序文档(如在线档案、社交媒体和新闻提要)以流方式更新，其中每个文档块都与顺利发展但相互依赖的主题相结合。这样的数字文本吸引了动态主题建模的广泛研究，以推断隐藏的演变主题及其时间依赖性。然而，现有的大多数方法都侧重于单主题线程的演化，而忽略了当前主题可能与多个相关的先前主题耦合的事实。此外，这些方法在推断潜在参数时也存在难以解决的推理问题，导致计算成本高，性能下降。在这项工作中，我们假设当前主题由具有相应耦合权值的所有先前主题演变而来，形成多主题-线程进化。我们的方法对不断发展的主题之间的依赖关系进行建模，并对它们在时间步长的复杂多重耦合进行彻底编码。为了克服难以解决的推理挑战，提出了一种新的解决方案，采用一组新颖的数据增强技术，成功地分解了进化主题之间的多重耦合。得到了一个完全共轭的模型，保证了推理技术的有效性和高效性。一种新型的Gibbs采样器采用后向前向滤波算法，能有效地以封闭形式学习潜在的时间演化参数。此外，利用潜在的印度自助过程复合分布，自动推断出总体主题数，并为每个顺序文档定制无偏差的稀疏主题比例。该方法在合成数据集和真实数据集上对竞争基线进行了评估，证明了其在低单词困惑度、高主题一致性和更好的文档时间预测方面优于基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Knowledge Discovery from Data (TKDD)

自引率

0.00%

发文量