Query Selection Techniques for Efficient Crawling of Structured Web Sources

22nd International Conference on Data Engineering (ICDE'06) Pub Date : 2006-04-03 DOI:10.1109/ICDE.2006.124

Ping Wu, Ji-Rong Wen, Huan Liu, Wei-Ying Ma

{"title":"Query Selection Techniques for Efficient Crawling of Structured Web Sources","authors":"Ping Wu, Ji-Rong Wen, Huan Liu, Wei-Ying Ma","doi":"10.1109/ICDE.2006.124","DOIUrl":null,"url":null,"abstract":"The high quality, structured data from Web structured sources is invaluable for many applications. Hidden Web databases are not directly crawlable by Web search engines and are only accessible through Web query forms or via Web service interfaces. Recent research efforts have been focusing on understanding these Web query forms. A critical but still largely unresolved question is: how to efficiently acquire the structured information inside Web databases through iteratively issuing meaningful queries? In this paper we focus on the central issue of enabling efficient Web database crawling through query selection, i.e. how to select good queries to rapidly harvest data records from Web databases. We model each structured Web database as a distinct attribute-value graph. Under this theoretical framework, the database crawling problem is transformed into a graph traversal one that follows \"relational\" links. We show that finding an optimal query selection plan is equivalent to finding a Minimum Weighted Dominating Set of the corresponding database graph, a well-known NP-Complete problem. We propose a suite of query selection techniques aiming at optimizing the query harvest rate. Extensive experimental evaluations over real Web sources and simulations over controlled database servers validate the effectiveness of our techniques and provide insights for future efforts in this","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"35 1","pages":"47-47"},"PeriodicalIF":0.0000,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"152","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"22nd International Conference on Data Engineering (ICDE'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2006.124","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 152

Abstract

The high quality, structured data from Web structured sources is invaluable for many applications. Hidden Web databases are not directly crawlable by Web search engines and are only accessible through Web query forms or via Web service interfaces. Recent research efforts have been focusing on understanding these Web query forms. A critical but still largely unresolved question is: how to efficiently acquire the structured information inside Web databases through iteratively issuing meaningful queries? In this paper we focus on the central issue of enabling efficient Web database crawling through query selection, i.e. how to select good queries to rapidly harvest data records from Web databases. We model each structured Web database as a distinct attribute-value graph. Under this theoretical framework, the database crawling problem is transformed into a graph traversal one that follows "relational" links. We show that finding an optimal query selection plan is equivalent to finding a Minimum Weighted Dominating Set of the corresponding database graph, a well-known NP-Complete problem. We propose a suite of query selection techniques aiming at optimizing the query harvest rate. Extensive experimental evaluations over real Web sources and simulations over controlled database servers validate the effectiveness of our techniques and provide insights for future efforts in this

查看原文本刊更多论文

结构化Web资源高效抓取的查询选择技术

来自Web结构化源的高质量结构化数据对许多应用程序都是无价的。隐藏的Web数据库不能被Web搜索引擎直接抓取，只能通过Web查询表单或Web服务接口访问。最近的研究工作集中在理解这些Web查询表单上。一个关键但仍未解决的问题是:如何通过迭代地发出有意义的查询来有效地获取Web数据库中的结构化信息?在本文中，我们关注通过查询选择实现高效Web数据库爬行的核心问题，即如何选择好的查询来快速从Web数据库中获取数据记录。我们将每个结构化Web数据库建模为一个不同的属性-值图。在这个理论框架下，数据库爬行问题被转化为遵循“关系”链接的图遍历问题。我们证明了找到最优查询选择计划等同于找到相应数据库图的最小加权支配集，这是一个众所周知的np完全问题。我们提出了一套旨在优化查询收获率的查询选择技术。对真实Web源的广泛实验评估和对受控数据库服务器的模拟验证了我们技术的有效性，并为这方面的未来工作提供了见解

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

22nd International Conference on Data Engineering (ICDE'06)

自引率

0.00%

发文量