Query Selection Techniques for Efficient Crawling of Structured Web Sources

Ping Wu, Ji-Rong Wen, Huan Liu, Wei-Ying Ma
{"title":"Query Selection Techniques for Efficient Crawling of Structured Web Sources","authors":"Ping Wu, Ji-Rong Wen, Huan Liu, Wei-Ying Ma","doi":"10.1109/ICDE.2006.124","DOIUrl":null,"url":null,"abstract":"The high quality, structured data from Web structured sources is invaluable for many applications. Hidden Web databases are not directly crawlable by Web search engines and are only accessible through Web query forms or via Web service interfaces. Recent research efforts have been focusing on understanding these Web query forms. A critical but still largely unresolved question is: how to efficiently acquire the structured information inside Web databases through iteratively issuing meaningful queries? In this paper we focus on the central issue of enabling efficient Web database crawling through query selection, i.e. how to select good queries to rapidly harvest data records from Web databases. We model each structured Web database as a distinct attribute-value graph. Under this theoretical framework, the database crawling problem is transformed into a graph traversal one that follows \"relational\" links. We show that finding an optimal query selection plan is equivalent to finding a Minimum Weighted Dominating Set of the corresponding database graph, a well-known NP-Complete problem. We propose a suite of query selection techniques aiming at optimizing the query harvest rate. Extensive experimental evaluations over real Web sources and simulations over controlled database servers validate the effectiveness of our techniques and provide insights for future efforts in this","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"152","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"22nd International Conference on Data Engineering (ICDE'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2006.124","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 152

Abstract

The high quality, structured data from Web structured sources is invaluable for many applications. Hidden Web databases are not directly crawlable by Web search engines and are only accessible through Web query forms or via Web service interfaces. Recent research efforts have been focusing on understanding these Web query forms. A critical but still largely unresolved question is: how to efficiently acquire the structured information inside Web databases through iteratively issuing meaningful queries? In this paper we focus on the central issue of enabling efficient Web database crawling through query selection, i.e. how to select good queries to rapidly harvest data records from Web databases. We model each structured Web database as a distinct attribute-value graph. Under this theoretical framework, the database crawling problem is transformed into a graph traversal one that follows "relational" links. We show that finding an optimal query selection plan is equivalent to finding a Minimum Weighted Dominating Set of the corresponding database graph, a well-known NP-Complete problem. We propose a suite of query selection techniques aiming at optimizing the query harvest rate. Extensive experimental evaluations over real Web sources and simulations over controlled database servers validate the effectiveness of our techniques and provide insights for future efforts in this
结构化Web资源高效抓取的查询选择技术
来自Web结构化源的高质量结构化数据对许多应用程序都是无价的。隐藏的Web数据库不能被Web搜索引擎直接抓取,只能通过Web查询表单或Web服务接口访问。最近的研究工作集中在理解这些Web查询表单上。一个关键但仍未解决的问题是:如何通过迭代地发出有意义的查询来有效地获取Web数据库中的结构化信息?在本文中,我们关注通过查询选择实现高效Web数据库爬行的核心问题,即如何选择好的查询来快速从Web数据库中获取数据记录。我们将每个结构化Web数据库建模为一个不同的属性-值图。在这个理论框架下,数据库爬行问题被转化为遵循“关系”链接的图遍历问题。我们证明了找到最优查询选择计划等同于找到相应数据库图的最小加权支配集,这是一个众所周知的np完全问题。我们提出了一套旨在优化查询收获率的查询选择技术。对真实Web源的广泛实验评估和对受控数据库服务器的模拟验证了我们技术的有效性,并为这方面的未来工作提供了见解
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信