通过查询估计隐藏数据源的大小

Yan Wang, Jie Liang, Jianguo Lu
{"title":"通过查询估计隐藏数据源的大小","authors":"Yan Wang, Jie Liang, Jianguo Lu","doi":"10.1109/ASONAM.2014.6921664","DOIUrl":null,"url":null,"abstract":"The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al's estimation method on all the datasets.","PeriodicalId":143584,"journal":{"name":"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Estimating the size of hidden data sources by queries\",\"authors\":\"Yan Wang, Jie Liang, Jianguo Lu\",\"doi\":\"10.1109/ASONAM.2014.6921664\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al's estimation method on all the datasets.\",\"PeriodicalId\":143584,\"journal\":{\"name\":\"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASONAM.2014.6921664\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASONAM.2014.6921664","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

隐藏数据源的大小引起了公众、研究人员甚至商业竞争对手的极大兴趣。估计隐藏数据源的大小一直是一个具有挑战性的问题。大多数现有的方法都是从经典的捕获-再捕获方法派生出来的。另一种方法是基于大型查询池。由于查询池中查询的文档频率差异很大,因此该方法不准确。针对这个问题,我们提出了一种新的方法,通过从目标数据源的样本中构造一个查询池来减少方差,从而减少文档频率方差,同时可以覆盖大多数文档。我们的方法在各种大型文本语料库上进行了测试,并且在所有数据集上都优于基线随机查询方法和Broder等人的估计方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Estimating the size of hidden data sources by queries
The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al's estimation method on all the datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信