Learning information extraction patterns from tabular Web pages without manual labelling

Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003) Pub Date : 2003-10-13 DOI:10.1109/WI.2003.1241249

Xiaoying Gao, Mengjie Zhang, Peter M. Andreae

引用次数: 7

Abstract

We describe a domain independent approach to automatically constructing information extraction patterns for semistructured Web pages. The approach was tested on three corpora containing a series of tabular Web sites from different domains and achieved a success rate of at least 80%. A significant strength of the system is that it can infer extraction patterns from a single training page and does not require any manual labeling of the training page.

查看原文本刊更多论文

从表格式网页中学习信息提取模式，无需手动标记

我们描述了一种独立于领域的方法来自动构建半结构化Web页面的信息提取模式。该方法在包含一系列来自不同领域的表格网站的三个语料库上进行了测试，并取得了至少80%的成功率。该系统的一个重要优点是它可以从单个训练页面推断提取模式，并且不需要对训练页面进行任何手动标记。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)

自引率

0.00%

发文量