从网页中半自动提取目标数据

22nd International Conference on Data Engineering Workshops (ICDEW'06) Pub Date : 2006-04-03 DOI:10.1109/ICDEW.2006.135

Fabrice Estiévenart, Jean-Roch Meurisse, Jean-Luc Hainaut, Philippe Thiran

{"title":"从网页中半自动提取目标数据","authors":"Fabrice Estiévenart, Jean-Roch Meurisse, Jean-Luc Hainaut, Philippe Thiran","doi":"10.1109/ICDEW.2006.135","DOIUrl":null,"url":null,"abstract":"TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Semi-Automated Extraction of Targeted Data fromWeb Pages\",\"authors\":\"Fabrice Estiévenart, Jean-Roch Meurisse, Jean-Luc Hainaut, Philippe Thiran\",\"doi\":\"10.1109/ICDEW.2006.135\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.\",\"PeriodicalId\":331953,\"journal\":{\"name\":\"22nd International Conference on Data Engineering Workshops (ICDEW'06)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"22nd International Conference on Data Engineering Workshops (ICDEW'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDEW.2006.135\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDEW.2006.135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

万维网可以被认为是个人和组织的无限信息来源。然而，如果在Web上发布的主要标准(HTML)非常适合人类阅读，其糟糕的语义使得计算机难以以智能和自动化的方式处理和使用嵌入的数据。在本文中，我们建议通过所谓的映射规则在HTML文档和外部应用程序之间建立一座桥梁。这些规则主要记录类似Web文档集群中重复出现的信息类型的语义解释及其在这些文档中的位置。依靠这些规则，可以将嵌入html的数据提取为更可计算的格式。映射规则的定义基于用户直接输入(主要用于解释部分)和自动计算数据在HTML树结构中的位置。这种方法由一个名为Retrozilla的用户友好工具支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Semi-Automated Extraction of Targeted Data fromWeb Pages

TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

22nd International Conference on Data Engineering Workshops (ICDEW'06)

自引率

0.00%

发文量