Estimating the size of hidden data sources by queries

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014) Pub Date : 2014-08-17 DOI:10.1109/ASONAM.2014.6921664

Yan Wang, Jie Liang, Jianguo Lu

引用次数: 2

Abstract

The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al's estimation method on all the datasets.

查看原文本刊更多论文

通过查询估计隐藏数据源的大小

隐藏数据源的大小引起了公众、研究人员甚至商业竞争对手的极大兴趣。估计隐藏数据源的大小一直是一个具有挑战性的问题。大多数现有的方法都是从经典的捕获-再捕获方法派生出来的。另一种方法是基于大型查询池。由于查询池中查询的文档频率差异很大，因此该方法不准确。针对这个问题，我们提出了一种新的方法，通过从目标数据源的样本中构造一个查询池来减少方差，从而减少文档频率方差，同时可以覆盖大多数文档。我们的方法在各种大型文本语料库上进行了测试，并且在所有数据集上都优于基线随机查询方法和Broder等人的估计方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)

自引率

0.00%

发文量