To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data Pub Date : 2006-06-27 DOI:10.1145/1142473.1142504

Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, L. Gravano

{"title":"To search or to crawl?: towards a query optimizer for text-centric tasks","authors":"Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, L. Gravano","doi":"10.1145/1142473.1142504","DOIUrl":null,"url":null,"abstract":"Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or 'crawl,\" the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output \"completeness\" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"81","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1142473.1142504","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 81

Abstract

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or 'crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.

查看原文本刊更多论文

搜索还是爬行?面向以文本为中心的任务的查询优化器

文本无处不在，毫不奇怪，许多重要的应用程序依赖文本数据来完成各种任务。作为一个显著的例子，信息提取应用程序从非结构化文本中获得结构化关系;作为另一个例子，聚焦爬虫探索网络以定位关于特定主题的页面。以文本为中心的任务的执行计划遵循处理文本数据库的两种通用范例:我们可以扫描或“爬行”文本数据库，或者，我们可以利用搜索引擎索引，并通过以特定于任务的方式构造的精心设计的查询来检索感兴趣的文档。基于爬行和基于查询的执行计划之间的选择可能会对执行时间和输出“完整性”(例如，在召回方面)产生重大影响。然而，这种选择通常是临时的，基于启发式或简单的直觉。在本文中，我们提出了基本的构建块，以明智的、基于成本的方式为以文本为中心的任务选择执行计划。为了实现这一目标，我们将展示如何从执行时间和输出完整性两方面分析基于查询和爬虫的计划。我们采用随机图理论和统计学的结果，为执行计划建立了严格的成本模型。我们的成本模型反映了这样一个事实，即计划的性能取决于底层文本数据库的基本任务特定属性。我们确定了这些属性，并提出了估算相关成本模型参数的有效技术。总的来说，我们的方法有助于预测任务的最合适的执行计划，从而显著提高效率和输出完整性。我们通过对三个重要的以文本为中心的任务和多个真实数据集的大规模实验评估来补充我们的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2006 ACM SIGMOD international conference on Management of data

自引率

0.00%

发文量