Testbed for information extraction from deep web

WWW Alt. '04 Pub Date : 2004-05-19 DOI:10.1145/1013367.1013468

Yasuhiro Yamada, Nick Craswell, Tetsuya Nakatoh, S. Hirokawa

引用次数: 49

Abstract

Search results generated by searchable databases are served dynamically and far larger than the static documents on the Web. These results pages have been referred to as the Deep Web. We need to extract the target data in results pages to integrate them on different searchable databases. We propose a test bed for information extraction from search results. We chose 100 databases randomly from 114,540 pages with search forms. Therefore, these databases have a good variety. We selected 51 databases which include URLs in a results pageand manually identify target information to be extracted. We also suggest evaluation measures for comparing extraction methods and methods for extending the target data.

查看原文本刊更多论文

深网信息提取实验平台

可搜索数据库生成的搜索结果是动态提供的，并且比Web上的静态文档大得多。这些结果页面被称为深网。我们需要在结果页面中提取目标数据，以便将它们集成到不同的可搜索数据库中。我们提出了一个从搜索结果中提取信息的测试平台。我们从114,540个带有搜索表单的页面中随机选择了100个数据库。因此，这些数据库具有很好的多样性。我们选择了51个数据库，其中包含结果页面中的url，并手动识别要提取的目标信息。我们还提出了比较提取方法和目标数据扩展方法的评价指标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

WWW Alt. '04

自引率

0.00%

发文量