Dirichlet-Hawkes过程在连续时间文档流聚类中的应用

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI:10.1145/2783258.2783411

Nan Du, Mehrdad Farajtabar, Amr Ahmed, Alex Smola, Le Song

{"title":"Dirichlet-Hawkes过程在连续时间文档流聚类中的应用","authors":"Nan Du, Mehrdad Farajtabar, Amr Ahmed, Alex Smola, Le Song","doi":"10.1145/2783258.2783411","DOIUrl":null,"url":null,"abstract":"Clusters in document streams, such as online news articles, can be induced by their textual contents, as well as by the temporal dynamics of their arriving patterns. Can we leverage both sources of information to obtain a better clustering of the documents, and distill information that is not possible to extract using contents only? In this paper, we propose a novel random process, referred to as the Dirichlet-Hawkes process, to take into account both information in a unified framework. A distinctive feature of the proposed model is that the preferential attachment of items to clusters according to cluster sizes, present in Dirichlet processes, is now driven according to the intensities of cluster-wise self-exciting temporal point processes, the Hawkes processes. This new model establishes a previously unexplored connection between Bayesian Nonparametrics and temporal Point Processes, which makes the number of clusters grow to accommodate the increasing complexity of online streaming contents, while at the same time adapts to the ever changing dynamics of the respective continuous arrival time. We conducted large-scale experiments on both synthetic and real world news articles, and show that Dirichlet-Hawkes processes can recover both meaningful topics and temporal dynamics, which leads to better predictive performance in terms of content perplexity and arrival time of future documents.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"172","resultStr":"{\"title\":\"Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams\",\"authors\":\"Nan Du, Mehrdad Farajtabar, Amr Ahmed, Alex Smola, Le Song\",\"doi\":\"10.1145/2783258.2783411\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Clusters in document streams, such as online news articles, can be induced by their textual contents, as well as by the temporal dynamics of their arriving patterns. Can we leverage both sources of information to obtain a better clustering of the documents, and distill information that is not possible to extract using contents only? In this paper, we propose a novel random process, referred to as the Dirichlet-Hawkes process, to take into account both information in a unified framework. A distinctive feature of the proposed model is that the preferential attachment of items to clusters according to cluster sizes, present in Dirichlet processes, is now driven according to the intensities of cluster-wise self-exciting temporal point processes, the Hawkes processes. This new model establishes a previously unexplored connection between Bayesian Nonparametrics and temporal Point Processes, which makes the number of clusters grow to accommodate the increasing complexity of online streaming contents, while at the same time adapts to the ever changing dynamics of the respective continuous arrival time. We conducted large-scale experiments on both synthetic and real world news articles, and show that Dirichlet-Hawkes processes can recover both meaningful topics and temporal dynamics, which leads to better predictive performance in terms of content perplexity and arrival time of future documents.\",\"PeriodicalId\":243428,\"journal\":{\"name\":\"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\",\"volume\":\"98 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-08-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"172\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2783258.2783411\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2783258.2783411","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 172

摘要

文档流中的集群，例如在线新闻文章，可以由它们的文本内容以及它们到达模式的时间动态来诱导。我们是否可以利用这两个信息源来获得更好的文档聚类，并提取仅使用内容无法提取的信息?在本文中，我们提出了一种新的随机过程，称为Dirichlet-Hawkes过程，以在一个统一的框架中考虑这两种信息。该模型的一个显著特征是，Dirichlet过程中存在的根据簇大小对项目的优先依恋，现在是根据簇智能自激时间点过程(Hawkes过程)的强度来驱动的。这个新模型在贝叶斯非参数和时间点过程之间建立了一个以前未被探索的联系，这使得簇的数量增加以适应在线流媒体内容日益增加的复杂性，同时适应各自连续到达时间的不断变化的动态。我们对合成和真实世界的新闻文章进行了大规模实验，结果表明，Dirichlet-Hawkes过程可以同时恢复有意义的主题和时间动态，从而在内容困惑度和未来文档到达时间方面具有更好的预测性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams

Clusters in document streams, such as online news articles, can be induced by their textual contents, as well as by the temporal dynamics of their arriving patterns. Can we leverage both sources of information to obtain a better clustering of the documents, and distill information that is not possible to extract using contents only? In this paper, we propose a novel random process, referred to as the Dirichlet-Hawkes process, to take into account both information in a unified framework. A distinctive feature of the proposed model is that the preferential attachment of items to clusters according to cluster sizes, present in Dirichlet processes, is now driven according to the intensities of cluster-wise self-exciting temporal point processes, the Hawkes processes. This new model establishes a previously unexplored connection between Bayesian Nonparametrics and temporal Point Processes, which makes the number of clusters grow to accommodate the increasing complexity of online streaming contents, while at the same time adapts to the ever changing dynamics of the respective continuous arrival time. We conducted large-scale experiments on both synthetic and real world news articles, and show that Dirichlet-Hawkes processes can recover both meaningful topics and temporal dynamics, which leads to better predictive performance in terms of content perplexity and arrival time of future documents.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

自引率

0.00%

发文量