Schema inference and data extraction from templatized Web pages

Shinde Santaji Krishna, J. S. Dattatraya
{"title":"Schema inference and data extraction from templatized Web pages","authors":"Shinde Santaji Krishna, J. S. Dattatraya","doi":"10.1109/PERVASIVE.2015.7087084","DOIUrl":null,"url":null,"abstract":"The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.","PeriodicalId":442000,"journal":{"name":"2015 International Conference on Pervasive Computing (ICPC)","volume":"526 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Pervasive Computing (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PERVASIVE.2015.7087084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.
从模板化的Web页面进行模式推断和数据提取
万维网是一个庞大而迅速增长的信息来源。web数据提取系统是指从web页面中自动提取数据的系统。然而,有很多网站的大部分页面都包含结构化数据。因此,对于Web信息集成,一个重要的步骤是从Web文档中为网站提取信息。本文提出了一种提供页面级数据提取任务的无监督方法。它自动检测网页的模式。根据视觉线索对网页进行比较,以找到固定/可变的模板页面。然后从网页中提取数据区域,如果数据区域属于固定模板,则采用树合并、树对齐和挖掘技术进行模式识别。对于异构模板页面,采用变量树匹配算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信