使用文本锚的Web数据提取

Ahmad Pouramini, Sh. Nasiri
{"title":"使用文本锚的Web数据提取","authors":"Ahmad Pouramini, Sh. Nasiri","doi":"10.1109/KBEI.2015.7436204","DOIUrl":null,"url":null,"abstract":"In this paper, we present an approach and a visual tool, called ABDES, for creating web wrappers to extract data records from web pages. In our approach, we rely mainly on the visible page content, simulating the way a human user scans a web page for specific data. To create a wrapper, we use text features such as textual delimiters, keywords, constants or text patterns, which we call anchors, to create patterns for the target data regions and data records. We offer a polynomial data extraction algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM tree. The extracted data is directly mapped onto a hierarchical XML structure as the output of the algorithm. The wrappers generated by the system are robust and independent of the HTML structure. Therefore, they can be adapted to multiple websites to gather and integrate information.","PeriodicalId":168295,"journal":{"name":"2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Web data extraction using textual anchors\",\"authors\":\"Ahmad Pouramini, Sh. Nasiri\",\"doi\":\"10.1109/KBEI.2015.7436204\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we present an approach and a visual tool, called ABDES, for creating web wrappers to extract data records from web pages. In our approach, we rely mainly on the visible page content, simulating the way a human user scans a web page for specific data. To create a wrapper, we use text features such as textual delimiters, keywords, constants or text patterns, which we call anchors, to create patterns for the target data regions and data records. We offer a polynomial data extraction algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM tree. The extracted data is directly mapped onto a hierarchical XML structure as the output of the algorithm. The wrappers generated by the system are robust and independent of the HTML structure. Therefore, they can be adapted to multiple websites to gather and integrate information.\",\"PeriodicalId\":168295,\"journal\":{\"name\":\"2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/KBEI.2015.7436204\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KBEI.2015.7436204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

在本文中,我们提出了一种方法和一个可视化工具,称为ABDES,用于创建web包装器来从网页中提取数据记录。在我们的方法中,我们主要依赖于可见的页面内容,模拟人类用户扫描网页以获取特定数据的方式。为了创建包装器,我们使用文本特性,如文本分隔符、关键字、常量或文本模式(我们称之为锚)来为目标数据区域和数据记录创建模式。我们提供了一种多项式数据提取算法,在该算法中,这些模式会在DOM树的混合自底向上和自顶向下遍历中根据页面元素进行检查。提取的数据直接映射到分层XML结构,作为算法的输出。系统生成的包装器健壮且独立于HTML结构。因此,它们可以适应多个网站来收集和整合信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Web data extraction using textual anchors
In this paper, we present an approach and a visual tool, called ABDES, for creating web wrappers to extract data records from web pages. In our approach, we rely mainly on the visible page content, simulating the way a human user scans a web page for specific data. To create a wrapper, we use text features such as textual delimiters, keywords, constants or text patterns, which we call anchors, to create patterns for the target data regions and data records. We offer a polynomial data extraction algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM tree. The extracted data is directly mapped onto a hierarchical XML structure as the output of the algorithm. The wrappers generated by the system are robust and independent of the HTML structure. Therefore, they can be adapted to multiple websites to gather and integrate information.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信