A tool-supported method to extract data and schema from Web sites

Fabrice Estiévenart, Aurore François, J. Henrard, Jean-Luc Hainaut
{"title":"A tool-supported method to extract data and schema from Web sites","authors":"Fabrice Estiévenart, Aurore François, J. Henrard, Jean-Luc Hainaut","doi":"10.1109/WSE.2003.1234003","DOIUrl":null,"url":null,"abstract":"This paper presents a tool-supported method to reengineer Web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualized into a unique schema describing the domain covered by the whole Web site. Finally, this conceptual schema is used to build the database of a renovated Web site. These principles are illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.","PeriodicalId":220870,"journal":{"name":"Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings.","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WSE.2003.1234003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26

Abstract

This paper presents a tool-supported method to reengineer Web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualized into a unique schema describing the domain covered by the whole Web site. Finally, this conceptual schema is used to build the database of a renovated Web site. These principles are illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.
一种工具支持的方法,用于从Web站点提取数据和模式
本文提出了一种工具支持的方法来重新设计Web站点,即将页面内容提取为由表达性dtd或XML schema结构化的XML文档。对所有被识别为表达同一应用(子)领域的页面进行分析,以得出它们的公共结构。该结构由一个称为META的XML文档形式化,然后使用该文档提取包含页面数据的XML文档和验证这些数据的XML Schema。META文档可以描述不同的结构,例如相同概念的可选布局和数据结构、结构多样性以及布局和信息内容之间的分离。从不同页面类型中提取的XML模式被集成并概念化为描述整个Web站点所涵盖的域的唯一模式。最后,将此概念模式用于构建更新后的Web站点的数据库。通过使用创建META文档、提取数据和XML Schema的工具进行案例研究,说明了这些原则。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信