Automatic web data extraction using tree alignment

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI:10.1145/1645953.1646194

Yingju Xia, Hao Yu, Shu Zhang

引用次数: 5

Abstract

This paper investigates the automatic extraction of data from forums, blogs and news web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment to automatically extract data from these types of web pages. A new tree alignment algorithm is presented for determining the optimal matching structure of the input web pages. Based on the alignment, the trees are merged into one union tree whose nodes record statistical information obtained from multiple web pages. A heuristic method is employed for determining the most probable content block and the alignment algorithm detects repeating patterns on the union tree. A wrapper built on the most probable content block and the repeating patterns extracts data from web pages. Experimental results show that the method achieves high extraction accuracy and has steady performance.

查看原文本刊更多论文

使用树对齐的自动web数据提取

本文研究了论坛、博客和新闻网站数据的自动提取。Web页面越来越多地使用填充了来自数据库的数据的通用模板动态生成。本文提出了一种利用树对齐的方法来自动提取这些类型网页的数据。提出了一种确定输入网页最优匹配结构的树对齐算法。基于对齐，树被合并成一个联合树，该树的节点记录了从多个网页获得的统计信息。采用启发式方法确定最可能的内容块，对齐算法检测联合树上的重复模式。基于最可能的内容块和重复模式构建的包装器从网页中提取数据。实验结果表明，该方法提取精度高，性能稳定。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 18th ACM conference on Information and knowledge management

自引率

0.00%

发文量