Web-scale information extraction with vertex

2011 IEEE 27th International Conference on Data Engineering Pub Date : 2011-04-11 DOI:10.1109/ICDE.2011.5767842

P. Gulhane, Amit Madaan, Rupesh R. Mehta, J. Ramamirtham, R. Rastogi, Sandeepkumar Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, Charu Tiwari

引用次数: 89

Abstract

Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.

查看原文本刊更多论文

基于顶点的web级信息提取

Vertex是Yahoo!用于从基于模板的Web页面中提取结构化记录。为了在Web规模上进行操作，Vertex采用了许多新颖的算法(1)在Web站点中对类似的结构化页面进行分组，(2)为包装器推理选择适当的示例页面，(3)学习对站点结构变化具有鲁棒性的基于xpath的提取规则，(4)通过监控示例页面来检测站点变化，以及(5)通过重用规则来优化编辑成本，等等。该系统部署在生产环境中，目前从200多个Web站点提取了2.5亿条记录。据我们所知，Vertex是第一个在Web规模上进行高精度信息提取的系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE 27th International Conference on Data Engineering

自引率

0.00%

发文量