Content-based Union and Complement Metrics for Dataset Search over RDF Knowledge Graphs

M. Mountantonakis, Yannis Tzitzikas
{"title":"Content-based Union and Complement Metrics for Dataset Search over RDF Knowledge Graphs","authors":"M. Mountantonakis, Yannis Tzitzikas","doi":"10.1145/3372750","DOIUrl":null,"url":null,"abstract":"RDF Knowledge Graphs (or Datasets) contain valuable information that can be exploited for a variety of real-world tasks. However, due to the enormous size of the available RDF datasets, it is difficult to discover the most valuable datasets for a given task. For improving dataset Discoverability, Interlinking, and Reusability, there is a trend for Dataset Search systems. Such systems are mainly based on metadata and ignore the contents; however, in tasks related to data integration and enrichment, the contents of datasets have to be considered. This is important for data integration but also for data enrichment, for instance, quite often datasets’ owners want to enrich the content of their dataset, by selecting datasets that provide complementary information for their dataset. The above tasks require content-based union and complement metrics between any subset of datasets; however, there is a lack of such approaches. For making feasible the computation of such metrics at very large scale, we propose an approach relying on (a) a set of pre-constructed (and periodically refreshed) semantics-aware indexes, and (b) “lattice-based” incremental algorithms that exploit the posting lists of such indexes, as well as set theory properties, for enabling efficient responses at query time. Finally, we discuss the efficiency of the proposed methods by presenting comparative results, and we report measurements for 400 real RDF datasets (containing over 2 billion triples), by exploiting the proposed metrics.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"15 1","pages":"1 - 31"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372750","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 23

Abstract

RDF Knowledge Graphs (or Datasets) contain valuable information that can be exploited for a variety of real-world tasks. However, due to the enormous size of the available RDF datasets, it is difficult to discover the most valuable datasets for a given task. For improving dataset Discoverability, Interlinking, and Reusability, there is a trend for Dataset Search systems. Such systems are mainly based on metadata and ignore the contents; however, in tasks related to data integration and enrichment, the contents of datasets have to be considered. This is important for data integration but also for data enrichment, for instance, quite often datasets’ owners want to enrich the content of their dataset, by selecting datasets that provide complementary information for their dataset. The above tasks require content-based union and complement metrics between any subset of datasets; however, there is a lack of such approaches. For making feasible the computation of such metrics at very large scale, we propose an approach relying on (a) a set of pre-constructed (and periodically refreshed) semantics-aware indexes, and (b) “lattice-based” incremental algorithms that exploit the posting lists of such indexes, as well as set theory properties, for enabling efficient responses at query time. Finally, we discuss the efficiency of the proposed methods by presenting comparative results, and we report measurements for 400 real RDF datasets (containing over 2 billion triples), by exploiting the proposed metrics.
RDF知识图上数据集搜索的基于内容的联合和补充度量
RDF知识图(或数据集)包含有价值的信息,可以用于各种现实世界的任务。然而,由于可用RDF数据集的庞大规模,很难为给定任务发现最有价值的数据集。为了提高数据集的可发现性、互联性和可重用性,数据集搜索系统是一个发展趋势。这类系统主要基于元数据,忽略了内容;然而,在与数据集成和丰富相关的任务中,必须考虑数据集的内容。这对数据集成很重要,但对数据丰富也很重要,例如,数据集的所有者经常希望通过选择为其数据集提供补充信息的数据集来丰富其数据集的内容。上述任务需要在任何数据集子集之间基于内容的联合和补充度量;然而,缺乏这样的方法。为了在非常大的规模上实现这些指标的计算,我们提出了一种方法,它依赖于(a)一组预先构建(并定期刷新)的语义感知索引,以及(b)“基于格子的”增量算法,该算法利用这些索引的发布列表以及集合理论属性,以便在查询时实现有效的响应。最后,我们通过提供比较结果来讨论所建议方法的效率,并通过利用所建议的度量报告400个真实RDF数据集(包含超过20亿个三元组)的测量结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信