DWSpyder: a new schema extraction method for a deep web integration system

Yasser Saissi, A. Zellou, Ali Adri
{"title":"DWSpyder: a new schema extraction method for a deep web integration system","authors":"Yasser Saissi, A. Zellou, Ali Adri","doi":"10.1504/ijwet.2019.102872","DOIUrl":null,"url":null,"abstract":"The deep web is a huge part of the web that is not indexed by search engines. The deep web sources are accessible only through their associated access forms. We wish to use a web integration system to access the deep web sources and all of their information. To implement this web integration system, we need to know the schema description of each web source. The problem resolved in this paper is how to extract the schema describing an inaccessible deep web source. We propose our DWSpyder method as being able to extract the schema describing a deep web source despite its inaccessibility. The DWSpyder method starts with a static analysis of the deep web source access forms in order to extract the first elements of the associated schema description. The second step of our method is a dynamic analysis of these access forms using queries to enrich our schema description. Our DWSpyder method also uses a clustering algorithm to identify the possible values of deep web form fields with undefined sets of values. All of the information extracted is used by DWSpyder to generate automatically deep web source schema descriptions.","PeriodicalId":396746,"journal":{"name":"Int. J. Web Eng. Technol.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Web Eng. Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1504/ijwet.2019.102872","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The deep web is a huge part of the web that is not indexed by search engines. The deep web sources are accessible only through their associated access forms. We wish to use a web integration system to access the deep web sources and all of their information. To implement this web integration system, we need to know the schema description of each web source. The problem resolved in this paper is how to extract the schema describing an inaccessible deep web source. We propose our DWSpyder method as being able to extract the schema describing a deep web source despite its inaccessibility. The DWSpyder method starts with a static analysis of the deep web source access forms in order to extract the first elements of the associated schema description. The second step of our method is a dynamic analysis of these access forms using queries to enrich our schema description. Our DWSpyder method also uses a clustering algorithm to identify the possible values of deep web form fields with undefined sets of values. All of the information extracted is used by DWSpyder to generate automatically deep web source schema descriptions.
DWSpyder:一种新的深度web集成系统模式提取方法
深网是网络的一个巨大部分,没有被搜索引擎索引。深层网络资源只能通过其相关的访问形式访问。我们希望使用一个网络集成系统来访问深网资源和他们所有的信息。为了实现这个web集成系统,我们需要知道每个web源的模式描述。本文解决的问题是如何提取描述不可访问深度web源的模式。我们提出DWSpyder方法,因为它能够提取描述深层网络源的模式,尽管它是不可访问的。DWSpyder方法首先对深网源访问表单进行静态分析,以便提取相关模式描述的第一个元素。我们方法的第二步是使用查询对这些访问表单进行动态分析,以丰富我们的模式描述。我们的DWSpyder方法还使用聚类算法来识别具有未定义值集的深网表单字段的可能值。DWSpyder使用提取的所有信息自动生成深网源模式描述。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信