Automatic web data extraction using tree alignment

Yingju Xia, Hao Yu, Shu Zhang
{"title":"Automatic web data extraction using tree alignment","authors":"Yingju Xia, Hao Yu, Shu Zhang","doi":"10.1145/1645953.1646194","DOIUrl":null,"url":null,"abstract":"This paper investigates the automatic extraction of data from forums, blogs and news web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment to automatically extract data from these types of web pages. A new tree alignment algorithm is presented for determining the optimal matching structure of the input web pages. Based on the alignment, the trees are merged into one union tree whose nodes record statistical information obtained from multiple web pages. A heuristic method is employed for determining the most probable content block and the alignment algorithm detects repeating patterns on the union tree. A wrapper built on the most probable content block and the repeating patterns extracts data from web pages. Experimental results show that the method achieves high extraction accuracy and has steady performance.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th ACM conference on Information and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1645953.1646194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

This paper investigates the automatic extraction of data from forums, blogs and news web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment to automatically extract data from these types of web pages. A new tree alignment algorithm is presented for determining the optimal matching structure of the input web pages. Based on the alignment, the trees are merged into one union tree whose nodes record statistical information obtained from multiple web pages. A heuristic method is employed for determining the most probable content block and the alignment algorithm detects repeating patterns on the union tree. A wrapper built on the most probable content block and the repeating patterns extracts data from web pages. Experimental results show that the method achieves high extraction accuracy and has steady performance.
使用树对齐的自动web数据提取
本文研究了论坛、博客和新闻网站数据的自动提取。Web页面越来越多地使用填充了来自数据库的数据的通用模板动态生成。本文提出了一种利用树对齐的方法来自动提取这些类型网页的数据。提出了一种确定输入网页最优匹配结构的树对齐算法。基于对齐,树被合并成一个联合树,该树的节点记录了从多个网页获得的统计信息。采用启发式方法确定最可能的内容块,对齐算法检测联合树上的重复模式。基于最可能的内容块和重复模式构建的包装器从网页中提取数据。实验结果表明,该方法提取精度高,性能稳定。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信