基于概率树编辑模型的鲁棒web提取方法

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data Pub Date : 2009-06-29 DOI:10.1145/1559845.1559882

Nilesh N. Dalvi, P. Bohannon, Fei Sha

{"title":"基于概率树编辑模型的鲁棒web提取方法","authors":"Nilesh N. Dalvi, P. Bohannon, Fei Sha","doi":"10.1145/1559845.1559882","DOIUrl":null,"url":null,"abstract":"On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically. Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently--in quadratic time in the size of the trees--making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"165 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"80","resultStr":"{\"title\":\"Robust web extraction: an approach based on a probabilistic tree-edit model\",\"authors\":\"Nilesh N. Dalvi, P. Bohannon, Fei Sha\",\"doi\":\"10.1145/1559845.1559882\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically. Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently--in quadratic time in the size of the trees--making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples.\",\"PeriodicalId\":344093,\"journal\":{\"name\":\"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data\",\"volume\":\"165 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-06-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"80\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1559845.1559882\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1559845.1559882","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 80

摘要

在脚本生成的web站点上，许多文档共享通用的HTML树结构，从而允许包装器有效地提取感兴趣的信息。当然，脚本和树结构会随着时间的推移而变化，导致包装器反复中断，并导致维护包装器的高成本。在本文中，我们探索了一种新颖的方法:我们使用网页的时间快照来开发HTML的树编辑模型，并使用该模型来改进包装器的构造。我们将树结构的变化视为一系列编辑操作的假设:删除节点、插入节点和替换节点的标签。树结构通过随机选择这些编辑操作而进化。我们的模型很有吸引力，因为源树进化成目标树的概率可以有效地估计——在树大小的二次时间内——使其成为各种树进化问题的潜在有用工具。我们给出了一种从由树对组成的训练样本中学习概率模型的算法，并将该算法应用于网页快照集合，以派生出特定于html的树编辑模型。最后，我们描述了一个考虑了树编辑模型的新的包装构造框架，并在合成的和真实的HTML文档示例中将所得到的包装与传统包装的质量进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Robust web extraction: an approach based on a probabilistic tree-edit model

On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically. Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently--in quadratic time in the size of the trees--making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

自引率

0.00%

发文量