基于启发式的深度Web查询接口模式提取

2017 IEEE International Conference on Information Reuse and Integration (IRI) Pub Date : 2017-08-01 DOI:10.1109/IRI.2017.80

Chichang Jou, Yucheng Cheng

{"title":"基于启发式的深度Web查询接口模式提取","authors":"Chichang Jou, Yucheng Cheng","doi":"10.1109/IRI.2017.80","DOIUrl":null,"url":null,"abstract":"Along with the popularity of the internet, contents inside web databases also increase quickly. These data, hidden behind the query interfaces, are called deep web. These contents normally are not collected by the search engines. Many deep web contents related applications, like contents collection, topic-focused crawling, and data integration, are based on understanding the schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. We propose a Heuristics-based deep web query interface Schema Extraction system (HSE) that identifies labels, elements, mappings among labels and elements, and relationships among elements. In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and dynamic similarity threshold setup to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user's view and the designer's view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Our schema not only helps extracting contents of the deep web, but also benefits the processes of schema matching and schema merging. The experimental results on the TEL-8 dataset show that HSE produces effective performance.","PeriodicalId":254330,"journal":{"name":"2017 IEEE International Conference on Information Reuse and Integration (IRI)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Heuristics-Based Schema Extraction for Deep Web Query Interfaces\",\"authors\":\"Chichang Jou, Yucheng Cheng\",\"doi\":\"10.1109/IRI.2017.80\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Along with the popularity of the internet, contents inside web databases also increase quickly. These data, hidden behind the query interfaces, are called deep web. These contents normally are not collected by the search engines. Many deep web contents related applications, like contents collection, topic-focused crawling, and data integration, are based on understanding the schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. We propose a Heuristics-based deep web query interface Schema Extraction system (HSE) that identifies labels, elements, mappings among labels and elements, and relationships among elements. In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and dynamic similarity threshold setup to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user's view and the designer's view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Our schema not only helps extracting contents of the deep web, but also benefits the processes of schema matching and schema merging. The experimental results on the TEL-8 dataset show that HSE produces effective performance.\",\"PeriodicalId\":254330,\"journal\":{\"name\":\"2017 IEEE International Conference on Information Reuse and Integration (IRI)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Information Reuse and Integration (IRI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IRI.2017.80\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Information Reuse and Integration (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2017.80","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着互联网的普及，网络数据库内的内容也在迅速增加。这些隐藏在查询接口后面的数据被称为深层网络。这些内容通常不会被搜索引擎收集。许多与深度web内容相关的应用程序，如内容收集、主题抓取和数据集成，都是基于对这些查询接口模式的理解。模式需要涵盖输入元素和标签的映射、有效输入值的数据类型以及输入值的范围约束。提出了一种基于启发式的深度web查询接口模式提取系统(HSE)，该系统能够识别标签、元素、标签与元素之间的映射以及元素之间的关系。在HSE中，元素周围的文本被收集为候选标签。我们提出了一个字符串相似函数和动态相似阈值设置来清理候选标签。在HSE中，元素、候选标签和查询接口中的新行被简化以生成接口表达式(IEXP)。结合用户视图和设计者视图，在语义信息的帮助下，构建启发式规则，从ICQ数据集中查询接口的IEXP中提取模式。这些规则是通过利用(1)标签和元素的特性，(2)标签和元素的空间、组和范围关系来构建的。我们的模式不仅有助于深度网络的内容提取，而且有利于模式匹配和模式合并的过程。在TEL-8数据集上的实验结果表明，HSE具有有效的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Heuristics-Based Schema Extraction for Deep Web Query Interfaces

Along with the popularity of the internet, contents inside web databases also increase quickly. These data, hidden behind the query interfaces, are called deep web. These contents normally are not collected by the search engines. Many deep web contents related applications, like contents collection, topic-focused crawling, and data integration, are based on understanding the schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. We propose a Heuristics-based deep web query interface Schema Extraction system (HSE) that identifies labels, elements, mappings among labels and elements, and relationships among elements. In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and dynamic similarity threshold setup to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user's view and the designer's view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Our schema not only helps extracting contents of the deep web, but also benefits the processes of schema matching and schema merging. The experimental results on the TEL-8 dataset show that HSE produces effective performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE International Conference on Information Reuse and Integration (IRI)

自引率

0.00%

发文量