基于视觉信息和部分树对齐的Web数据提取

2014 11th Web Information System and Application Conference Pub Date : 2014-09-12 DOI:10.1109/WISA.2014.12

Siwu Fan, Xinjun Wang, Yongquan Dong

{"title":"基于视觉信息和部分树对齐的Web数据提取","authors":"Siwu Fan, Xinjun Wang, Yongquan Dong","doi":"10.1109/WISA.2014.12","DOIUrl":null,"url":null,"abstract":"Web databases contain a huge amount of structured data which are easily obtained via their query interfaces only. The query results are presented in dynamically generated web pages, usually in the form of data records, for human use. The automatical web data extraction is critical in web integration. A number of approaches have been proposed. The early work are most based on the source code or the tag tree of the page. Recent approaches use the visual feature to extract data information, which are better than the previous work. However, these approaches still have inherent limitation. In this paper, we propose a novel approach that make use of visual features to extract data information from web page, including the data records and the data items. The results of this experiment tests on a large set of query result pages in different domain show that the proposed approach is highly effective.","PeriodicalId":366169,"journal":{"name":"2014 11th Web Information System and Application Conference","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Web Data Extraction Based on Visual Information and Partial Tree Alignment\",\"authors\":\"Siwu Fan, Xinjun Wang, Yongquan Dong\",\"doi\":\"10.1109/WISA.2014.12\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Web databases contain a huge amount of structured data which are easily obtained via their query interfaces only. The query results are presented in dynamically generated web pages, usually in the form of data records, for human use. The automatical web data extraction is critical in web integration. A number of approaches have been proposed. The early work are most based on the source code or the tag tree of the page. Recent approaches use the visual feature to extract data information, which are better than the previous work. However, these approaches still have inherent limitation. In this paper, we propose a novel approach that make use of visual features to extract data information from web page, including the data records and the data items. The results of this experiment tests on a large set of query result pages in different domain show that the proposed approach is highly effective.\",\"PeriodicalId\":366169,\"journal\":{\"name\":\"2014 11th Web Information System and Application Conference\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 11th Web Information System and Application Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WISA.2014.12\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 11th Web Information System and Application Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISA.2014.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

Web数据库包含大量的结构化数据，这些数据仅通过其查询接口即可轻松获取。查询结果通常以数据记录的形式呈现在动态生成的网页中，供人们使用。web数据的自动提取是web集成的关键。已经提出了一些方法。早期的工作大多是基于源代码或页面的标记树。最近的方法使用视觉特征来提取数据信息，这比以前的工作要好。然而，这些方法仍然存在固有的局限性。本文提出了一种利用视觉特征从网页中提取数据信息的新方法，包括数据记录和数据项。在不同领域的大量查询结果页面上进行了实验测试，结果表明该方法是非常有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Web Data Extraction Based on Visual Information and Partial Tree Alignment

Web databases contain a huge amount of structured data which are easily obtained via their query interfaces only. The query results are presented in dynamically generated web pages, usually in the form of data records, for human use. The automatical web data extraction is critical in web integration. A number of approaches have been proposed. The early work are most based on the source code or the tag tree of the page. Recent approaches use the visual feature to extract data information, which are better than the previous work. However, these approaches still have inherent limitation. In this paper, we propose a novel approach that make use of visual features to extract data information from web page, including the data records and the data items. The results of this experiment tests on a large set of query result pages in different domain show that the proposed approach is highly effective.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 11th Web Information System and Application Conference

自引率

0.00%

发文量