一种工具支持的方法，用于从Web站点提取数据和模式

Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings. Pub Date : 2003-09-29 DOI:10.1109/WSE.2003.1234003

Fabrice Estiévenart, Aurore François, J. Henrard, Jean-Luc Hainaut

{"title":"一种工具支持的方法，用于从Web站点提取数据和模式","authors":"Fabrice Estiévenart, Aurore François, J. Henrard, Jean-Luc Hainaut","doi":"10.1109/WSE.2003.1234003","DOIUrl":null,"url":null,"abstract":"This paper presents a tool-supported method to reengineer Web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualized into a unique schema describing the domain covered by the whole Web site. Finally, this conceptual schema is used to build the database of a renovated Web site. These principles are illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.","PeriodicalId":220870,"journal":{"name":"Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings.","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"A tool-supported method to extract data and schema from Web sites\",\"authors\":\"Fabrice Estiévenart, Aurore François, J. Henrard, Jean-Luc Hainaut\",\"doi\":\"10.1109/WSE.2003.1234003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a tool-supported method to reengineer Web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualized into a unique schema describing the domain covered by the whole Web site. Finally, this conceptual schema is used to build the database of a renovated Web site. These principles are illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.\",\"PeriodicalId\":220870,\"journal\":{\"name\":\"Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings.\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2003-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WSE.2003.1234003\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WSE.2003.1234003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

摘要

本文提出了一种工具支持的方法来重新设计Web站点，即将页面内容提取为由表达性dtd或XML schema结构化的XML文档。对所有被识别为表达同一应用(子)领域的页面进行分析，以得出它们的公共结构。该结构由一个称为META的XML文档形式化，然后使用该文档提取包含页面数据的XML文档和验证这些数据的XML Schema。META文档可以描述不同的结构，例如相同概念的可选布局和数据结构、结构多样性以及布局和信息内容之间的分离。从不同页面类型中提取的XML模式被集成并概念化为描述整个Web站点所涵盖的域的唯一模式。最后，将此概念模式用于构建更新后的Web站点的数据库。通过使用创建META文档、提取数据和XML Schema的工具进行案例研究，说明了这些原则。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A tool-supported method to extract data and schema from Web sites

This paper presents a tool-supported method to reengineer Web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualized into a unique schema describing the domain covered by the whole Web site. Finally, this conceptual schema is used to build the database of a renovated Web site. These principles are illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings.

自引率

0.00%

发文量