基于查询的分布式红外自适应采样

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval Pub Date : 2006-08-06 DOI:10.1145/1148170.1148277

L. Azzopardi, Mark Baillie, F. Crestani

{"title":"基于查询的分布式红外自适应采样","authors":"L. Azzopardi, Mark Baillie, F. Crestani","doi":"10.1145/1148170.1148277","DOIUrl":null,"url":null,"abstract":"In Distributed Information Retrieval systems (DIR), the widely accepted solution for resource description acquisition is Query-Based Sampling (QBS) [1]. In the standard approach to QBS, once 300-500 unique documents have been retrieved sampling is curtailed. This threshold was obtained by empirically measuring the estimated resource description against the actual resource, and then considering the corresponding retrieval selection accuracy [1]. However, a fixed threshold may not generalise to other collections and environments beyond that which it was estimated on (i.e. a set of resources of uniform size [1]). Cases when the blanket application of such a heuristic would be inappropriate include (1) when the sizes of resource are highly skewed and (2) when the resources are very heterogenous. In the former, if a resource is very large then undersampling will occur because not enough documents were obtained. Conversely, if a collection is very small in size, then oversampling will occur increasing costs beyond necessity. In the later case, if the resource is varied and highly heterogeneous, then to obtain a sufficiently accurate description would require more documents to be sampled than when resources are homogenous. Either way, adopting a flat cut off will not necessarily provide sufficiently good resource descriptions for all resources.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"95 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Adaptive query-based sampling for distributed IR\",\"authors\":\"L. Azzopardi, Mark Baillie, F. Crestani\",\"doi\":\"10.1145/1148170.1148277\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In Distributed Information Retrieval systems (DIR), the widely accepted solution for resource description acquisition is Query-Based Sampling (QBS) [1]. In the standard approach to QBS, once 300-500 unique documents have been retrieved sampling is curtailed. This threshold was obtained by empirically measuring the estimated resource description against the actual resource, and then considering the corresponding retrieval selection accuracy [1]. However, a fixed threshold may not generalise to other collections and environments beyond that which it was estimated on (i.e. a set of resources of uniform size [1]). Cases when the blanket application of such a heuristic would be inappropriate include (1) when the sizes of resource are highly skewed and (2) when the resources are very heterogenous. In the former, if a resource is very large then undersampling will occur because not enough documents were obtained. Conversely, if a collection is very small in size, then oversampling will occur increasing costs beyond necessity. In the later case, if the resource is varied and highly heterogeneous, then to obtain a sufficiently accurate description would require more documents to be sampled than when resources are homogenous. Either way, adopting a flat cut off will not necessarily provide sufficiently good resource descriptions for all resources.\",\"PeriodicalId\":433366,\"journal\":{\"name\":\"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval\",\"volume\":\"95 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1148170.1148277\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1148170.1148277","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

在分布式信息检索系统(DIR)中，广泛接受的资源描述获取解决方案是基于查询的采样(QBS)[1]。在QBS的标准方法中，一旦检索到300-500个唯一文档，就会减少抽样。该阈值是通过对估计的资源描述与实际资源进行经验度量，然后考虑相应的检索选择精度[1]得到的。然而，一个固定的阈值可能不能推广到其他的集合和环境，超出了它的估计(即一组统一大小的资源[1])。这种启发式的全面应用可能不合适的情况包括(1)当资源的大小高度倾斜时和(2)当资源非常异构时。在前一种情况下，如果资源非常大，则会因为没有获得足够的文档而发生欠采样。相反，如果集合的大小非常小，则会发生过采样，从而增加不必要的成本。在后一种情况下，如果资源是多种多样且高度异构的，那么要获得足够准确的描述，将需要比资源是同质的情况下采样更多的文档。无论哪种方式，采用一个平坦的截止都不一定能为所有资源提供足够好的资源描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Adaptive query-based sampling for distributed IR

In Distributed Information Retrieval systems (DIR), the widely accepted solution for resource description acquisition is Query-Based Sampling (QBS) [1]. In the standard approach to QBS, once 300-500 unique documents have been retrieved sampling is curtailed. This threshold was obtained by empirically measuring the estimated resource description against the actual resource, and then considering the corresponding retrieval selection accuracy [1]. However, a fixed threshold may not generalise to other collections and environments beyond that which it was estimated on (i.e. a set of resources of uniform size [1]). Cases when the blanket application of such a heuristic would be inappropriate include (1) when the sizes of resource are highly skewed and (2) when the resources are very heterogenous. In the former, if a resource is very large then undersampling will occur because not enough documents were obtained. Conversely, if a collection is very small in size, then oversampling will occur increasing costs beyond necessity. In the later case, if the resource is varied and highly heterogeneous, then to obtain a sufficiently accurate description would require more documents to be sampled than when resources are homogenous. Either way, adopting a flat cut off will not necessarily provide sufficiently good resource descriptions for all resources.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

自引率

0.00%

发文量