提示:大型RDF数据源的混合和增量类型发现

33rd International Conference on Scientific and Statistical Database Management Pub Date : 2021-07-06 DOI:10.1145/3468791.3468808

Nikolaos Kardoulakis, Kenza Kellou-Menouer, Georgia Troullinou, Zoubida Kedad, D. Plexousakis, H. Kondylakis

{"title":"提示:大型RDF数据源的混合和增量类型发现","authors":"Nikolaos Kardoulakis, Kenza Kellou-Menouer, Georgia Troullinou, Zoubida Kedad, D. Plexousakis, H. Kondylakis","doi":"10.1145/3468791.3468808","DOIUrl":null,"url":null,"abstract":"The rapid explosion of linked data has resulted into many weakly structured and incomplete data sources, where typing information might be missing. On the other hand, type information is essential for a number of tasks such as query answering, integration, summarization and partitioning. Existing approaches for type discovery, either completely ignore type declarations available in the dataset (implicit type discovery approaches), or rely only on existing types, in order to complement them (explicit type enrichment approaches). Implicit type discovery approaches are based on instance grouping, which requires an exhaustive comparison between the instances. This process is expensive and not incremental. Explicit type enrichment approaches on the other hand, are not able to identify new types and they can not process data sources that have little or no schema information. In this paper, we present HInT, the first incremental and hybrid type discovery system for RDF datasets, enabling type discovery in datasets where type declarations are missing. To achieve this goal, we incrementally identify the patterns of the various instances, we index and then group them to identify the types. During the processing of an instance, our approach exploits its type information, if available, to improve the quality of the discovered types by guiding the classification of the new instance in the correct group and by refining the groups already built. We analytically and experimentally show that our approach dominates in terms of efficiency, competitors from both worlds, implicit type discovery and explicit type enrichment while outperforming them in most of the cases in terms of quality.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"106 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"HInT: Hybrid and Incremental Type Discovery for Large RDF Data Sources\",\"authors\":\"Nikolaos Kardoulakis, Kenza Kellou-Menouer, Georgia Troullinou, Zoubida Kedad, D. Plexousakis, H. Kondylakis\",\"doi\":\"10.1145/3468791.3468808\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid explosion of linked data has resulted into many weakly structured and incomplete data sources, where typing information might be missing. On the other hand, type information is essential for a number of tasks such as query answering, integration, summarization and partitioning. Existing approaches for type discovery, either completely ignore type declarations available in the dataset (implicit type discovery approaches), or rely only on existing types, in order to complement them (explicit type enrichment approaches). Implicit type discovery approaches are based on instance grouping, which requires an exhaustive comparison between the instances. This process is expensive and not incremental. Explicit type enrichment approaches on the other hand, are not able to identify new types and they can not process data sources that have little or no schema information. In this paper, we present HInT, the first incremental and hybrid type discovery system for RDF datasets, enabling type discovery in datasets where type declarations are missing. To achieve this goal, we incrementally identify the patterns of the various instances, we index and then group them to identify the types. During the processing of an instance, our approach exploits its type information, if available, to improve the quality of the discovered types by guiding the classification of the new instance in the correct group and by refining the groups already built. We analytically and experimentally show that our approach dominates in terms of efficiency, competitors from both worlds, implicit type discovery and explicit type enrichment while outperforming them in most of the cases in terms of quality.\",\"PeriodicalId\":312773,\"journal\":{\"name\":\"33rd International Conference on Scientific and Statistical Database Management\",\"volume\":\"106 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"33rd International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3468791.3468808\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"33rd International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3468791.3468808","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

链接数据的快速爆炸导致了许多结构薄弱和不完整的数据源，其中可能缺少键入信息。另一方面，类型信息对于诸如查询应答、集成、摘要和分区等许多任务是必不可少的。现有的类型发现方法，要么完全忽略数据集中可用的类型声明(隐式类型发现方法)，要么仅依赖现有类型来补充它们(显式类型充实方法)。隐式类型发现方法基于实例分组，这需要在实例之间进行详尽的比较。这个过程是昂贵的，而且不是增量的。另一方面，显式类型充实方法不能识别新类型，也不能处理只有很少或没有模式信息的数据源。在本文中，我们提出了第一个用于RDF数据集的增量和混合类型发现系统HInT，它支持在缺少类型声明的数据集中进行类型发现。为了实现这一目标，我们增量地识别各种实例的模式，对它们进行索引，然后对它们进行分组以识别类型。在处理实例的过程中，我们的方法利用它的类型信息(如果可用)，通过在正确的组中指导新实例的分类，并通过改进已经构建的组，来提高所发现类型的质量。我们的分析和实验表明，我们的方法在效率方面占主导地位，来自两个世界的竞争对手，隐式类型发现和显式类型丰富，而在大多数情况下，在质量方面优于它们。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HInT: Hybrid and Incremental Type Discovery for Large RDF Data Sources

The rapid explosion of linked data has resulted into many weakly structured and incomplete data sources, where typing information might be missing. On the other hand, type information is essential for a number of tasks such as query answering, integration, summarization and partitioning. Existing approaches for type discovery, either completely ignore type declarations available in the dataset (implicit type discovery approaches), or rely only on existing types, in order to complement them (explicit type enrichment approaches). Implicit type discovery approaches are based on instance grouping, which requires an exhaustive comparison between the instances. This process is expensive and not incremental. Explicit type enrichment approaches on the other hand, are not able to identify new types and they can not process data sources that have little or no schema information. In this paper, we present HInT, the first incremental and hybrid type discovery system for RDF datasets, enabling type discovery in datasets where type declarations are missing. To achieve this goal, we incrementally identify the patterns of the various instances, we index and then group them to identify the types. During the processing of an instance, our approach exploits its type information, if available, to improve the quality of the discovered types by guiding the classification of the new instance in the correct group and by refining the groups already built. We analytically and experimentally show that our approach dominates in terms of efficiency, competitors from both worlds, implicit type discovery and explicit type enrichment while outperforming them in most of the cases in terms of quality.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

33rd International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量