Partitioning and Segment Organization Strategies for Real-Time Selective Search on Document Streams

Proceedings of the Tenth ACM International Conference on Web Search and Data Mining Pub Date : 2017-02-02 DOI:10.1145/3018661.3018727

Yulu Wang, Jimmy J. Lin

{"title":"Partitioning and Segment Organization Strategies for Real-Time Selective Search on Document Streams","authors":"Yulu Wang, Jimmy J. Lin","doi":"10.1145/3018661.3018727","DOIUrl":null,"url":null,"abstract":"The basic idea behind selective search is to partition a collection into topical clusters, and for each query, consider only a subset of the clusters that are likely to contain relevant documents. Previous work on web collections has shown that it is possible to retain high-quality results while considering only a small fraction of the collection. These studies, however, assume static collections where it is feasible to run batch clustering algorithms for partitioning. In this work, we consider the novel formulation of selective search on document streams (specifically, tweets), where partitioning must be performed incrementally. In our approach, documents are partitioned into temporal segments and selective search is performed within each segment: these segments can either be clustered using batch or online algorithms, and at different temporal granularities. For efficiency, we take advantage of word embeddings to reduce the dimensionality of the document vectors. Experiments with test collections from the TREC Microblog Tracks show that we are able to achieve precision indistinguishable from exhaustive search while considering only around 5% of the collection. Interestingly, we observe no significant effectiveness differences between batch vs. online clustering and between hourly vs. daily temporal segments, despite them being very different index organizations. This suggests that architectural choices should be primarily guided by efficiency considerations.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"93 7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3018661.3018727","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The basic idea behind selective search is to partition a collection into topical clusters, and for each query, consider only a subset of the clusters that are likely to contain relevant documents. Previous work on web collections has shown that it is possible to retain high-quality results while considering only a small fraction of the collection. These studies, however, assume static collections where it is feasible to run batch clustering algorithms for partitioning. In this work, we consider the novel formulation of selective search on document streams (specifically, tweets), where partitioning must be performed incrementally. In our approach, documents are partitioned into temporal segments and selective search is performed within each segment: these segments can either be clustered using batch or online algorithms, and at different temporal granularities. For efficiency, we take advantage of word embeddings to reduce the dimensionality of the document vectors. Experiments with test collections from the TREC Microblog Tracks show that we are able to achieve precision indistinguishable from exhaustive search while considering only around 5% of the collection. Interestingly, we observe no significant effectiveness differences between batch vs. online clustering and between hourly vs. daily temporal segments, despite them being very different index organizations. This suggests that architectural choices should be primarily guided by efficiency considerations.

查看原文本刊更多论文

文档流实时选择搜索的分区和段组织策略

选择性搜索背后的基本思想是将集合划分为主题集群，对于每个查询，只考虑可能包含相关文档的集群的子集。以前关于web集合的工作表明，在只考虑集合的一小部分的情况下，保留高质量的结果是可能的。然而，这些研究假设了静态集合，其中可以运行批处理聚类算法进行分区。在这项工作中，我们考虑了对文档流(特别是推文)进行选择性搜索的新公式，其中必须增量地执行分区。在我们的方法中，文档被划分为时间段，并在每个段内执行选择性搜索:这些段可以使用批处理或在线算法以不同的时间粒度进行聚类。为了提高效率，我们利用词嵌入来降低文档向量的维数。用TREC微博轨道的测试集进行的实验表明，我们可以在只考虑约5%的集合时达到与穷举搜索无法区分的精度。有趣的是，尽管它们是非常不同的索引组织，但我们观察到批聚类与在线聚类以及小时聚类与每日聚类之间没有显著的有效性差异。这表明，体系结构的选择应该主要以效率考虑为指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量