通过查询估计隐藏数据源的大小

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014) Pub Date : 2014-08-17 DOI:10.1109/ASONAM.2014.6921664

Yan Wang, Jie Liang, Jianguo Lu

{"title":"通过查询估计隐藏数据源的大小","authors":"Yan Wang, Jie Liang, Jianguo Lu","doi":"10.1109/ASONAM.2014.6921664","DOIUrl":null,"url":null,"abstract":"The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al's estimation method on all the datasets.","PeriodicalId":143584,"journal":{"name":"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Estimating the size of hidden data sources by queries\",\"authors\":\"Yan Wang, Jie Liang, Jianguo Lu\",\"doi\":\"10.1109/ASONAM.2014.6921664\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al's estimation method on all the datasets.\",\"PeriodicalId\":143584,\"journal\":{\"name\":\"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASONAM.2014.6921664\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASONAM.2014.6921664","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

隐藏数据源的大小引起了公众、研究人员甚至商业竞争对手的极大兴趣。估计隐藏数据源的大小一直是一个具有挑战性的问题。大多数现有的方法都是从经典的捕获-再捕获方法派生出来的。另一种方法是基于大型查询池。由于查询池中查询的文档频率差异很大，因此该方法不准确。针对这个问题，我们提出了一种新的方法，通过从目标数据源的样本中构造一个查询池来减少方差，从而减少文档频率方差，同时可以覆盖大多数文档。我们的方法在各种大型文本语料库上进行了测试，并且在所有数据集上都优于基线随机查询方法和Broder等人的估计方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Estimating the size of hidden data sources by queries

The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al's estimation method on all the datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)

自引率

0.00%

发文量