Fabrice Estiévenart, Aurore François, J. Henrard, Jean-Luc Hainaut
{"title":"A tool-supported method to extract data and schema from Web sites","authors":"Fabrice Estiévenart, Aurore François, J. Henrard, Jean-Luc Hainaut","doi":"10.1109/WSE.2003.1234003","DOIUrl":null,"url":null,"abstract":"This paper presents a tool-supported method to reengineer Web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualized into a unique schema describing the domain covered by the whole Web site. Finally, this conceptual schema is used to build the database of a renovated Web site. These principles are illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.","PeriodicalId":220870,"journal":{"name":"Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings.","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WSE.2003.1234003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26
Abstract
This paper presents a tool-supported method to reengineer Web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualized into a unique schema describing the domain covered by the whole Web site. Finally, this conceptual schema is used to build the database of a renovated Web site. These principles are illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.