Schema inference and data extraction from templatized Web pages

2015 International Conference on Pervasive Computing (ICPC) Pub Date : 2015-04-16 DOI:10.1109/PERVASIVE.2015.7087084

Shinde Santaji Krishna, J. S. Dattatraya

引用次数: 5

Abstract

The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.

查看原文本刊更多论文

从模板化的Web页面进行模式推断和数据提取

万维网是一个庞大而迅速增长的信息来源。web数据提取系统是指从web页面中自动提取数据的系统。然而，有很多网站的大部分页面都包含结构化数据。因此，对于Web信息集成，一个重要的步骤是从Web文档中为网站提取信息。本文提出了一种提供页面级数据提取任务的无监督方法。它自动检测网页的模式。根据视觉线索对网页进行比较，以找到固定/可变的模板页面。然后从网页中提取数据区域，如果数据区域属于固定模板，则采用树合并、树对齐和挖掘技术进行模式识别。对于异构模板页面，采用变量树匹配算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Pervasive Computing (ICPC)

自引率

0.00%

发文量