DWSpyder: a new schema extraction method for a deep web integration system

Int. J. Web Eng. Technol. Pub Date : 2019-10-03 DOI:10.1504/ijwet.2019.102872

Yasser Saissi, A. Zellou, Ali Adri

引用次数: 0

Abstract

The deep web is a huge part of the web that is not indexed by search engines. The deep web sources are accessible only through their associated access forms. We wish to use a web integration system to access the deep web sources and all of their information. To implement this web integration system, we need to know the schema description of each web source. The problem resolved in this paper is how to extract the schema describing an inaccessible deep web source. We propose our DWSpyder method as being able to extract the schema describing a deep web source despite its inaccessibility. The DWSpyder method starts with a static analysis of the deep web source access forms in order to extract the first elements of the associated schema description. The second step of our method is a dynamic analysis of these access forms using queries to enrich our schema description. Our DWSpyder method also uses a clustering algorithm to identify the possible values of deep web form fields with undefined sets of values. All of the information extracted is used by DWSpyder to generate automatically deep web source schema descriptions.

查看原文本刊更多论文

DWSpyder:一种新的深度web集成系统模式提取方法

深网是网络的一个巨大部分，没有被搜索引擎索引。深层网络资源只能通过其相关的访问形式访问。我们希望使用一个网络集成系统来访问深网资源和他们所有的信息。为了实现这个web集成系统，我们需要知道每个web源的模式描述。本文解决的问题是如何提取描述不可访问深度web源的模式。我们提出DWSpyder方法，因为它能够提取描述深层网络源的模式，尽管它是不可访问的。DWSpyder方法首先对深网源访问表单进行静态分析，以便提取相关模式描述的第一个元素。我们方法的第二步是使用查询对这些访问表单进行动态分析，以丰富我们的模式描述。我们的DWSpyder方法还使用聚类算法来识别具有未定义值集的深网表单字段的可能值。DWSpyder使用提取的所有信息自动生成深网源模式描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Int. J. Web Eng. Technol.

自引率

0.00%

发文量