Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-01-27 DOI:10.1145/3165713

M. Mountantonakis, Yannis Tzitzikas

{"title":"Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets","authors":"M. Mountantonakis, Yannis Tzitzikas","doi":"10.1145/3165713","DOIUrl":null,"url":null,"abstract":"Although the ultimate objective of Linked Data is linking and integration, it is not currently evident how connected the current Linked Open Data (LOD) cloud is. In this article, we focus on methods, supported by special indexes and algorithms, for performing measurements related to the connectivity of more than two datasets that are useful in various tasks including (a) Dataset Discovery and Selection; (b) Object Coreference, i.e., for obtaining complete information about a set of entities, including provenance information; (c) Data Quality Assessment and Improvement, i.e., for assessing the connectivity between any set of datasets and monitoring their evolution over time, as well as for estimating data veracity; (d) Dataset Visualizations; and various other tasks. Since it would be prohibitively expensive to perform all these measurements in a naïve way, in this article, we introduce indexes (and their construction algorithms) that can speed up such tasks. In brief, we introduce (i) a namespace-based prefix index, (ii) a sameAs catalog for computing the symmetric and transitive closure of the owl:sameAs relationships encountered in the datasets, (iii) a semantics-aware element index (that exploits the aforementioned indexes), and, finally, (iv) two lattice-based incremental algorithms for speeding up the computation of the intersection of URIs of any set of datasets. For enhancing scalability, we propose parallel index construction algorithms and parallel lattice-based incremental algorithms, we evaluate the achieved speedup using either a single machine or a cluster of machines, and we provide insights regarding the factors that affect efficiency. Finally, we report measurements about the connectivity of the (billion triples-sized) LOD cloud that have never been carried out so far.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"124 1","pages":"1 - 49"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3165713","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

Although the ultimate objective of Linked Data is linking and integration, it is not currently evident how connected the current Linked Open Data (LOD) cloud is. In this article, we focus on methods, supported by special indexes and algorithms, for performing measurements related to the connectivity of more than two datasets that are useful in various tasks including (a) Dataset Discovery and Selection; (b) Object Coreference, i.e., for obtaining complete information about a set of entities, including provenance information; (c) Data Quality Assessment and Improvement, i.e., for assessing the connectivity between any set of datasets and monitoring their evolution over time, as well as for estimating data veracity; (d) Dataset Visualizations; and various other tasks. Since it would be prohibitively expensive to perform all these measurements in a naïve way, in this article, we introduce indexes (and their construction algorithms) that can speed up such tasks. In brief, we introduce (i) a namespace-based prefix index, (ii) a sameAs catalog for computing the symmetric and transitive closure of the owl:sameAs relationships encountered in the datasets, (iii) a semantics-aware element index (that exploits the aforementioned indexes), and, finally, (iv) two lattice-based incremental algorithms for speeding up the computation of the intersection of URIs of any set of datasets. For enhancing scalability, we propose parallel index construction algorithms and parallel lattice-based incremental algorithms, we evaluate the achieved speedup using either a single machine or a cluster of machines, and we provide insights regarding the factors that affect efficiency. Finally, we report measurements about the connectivity of the (billion triples-sized) LOD cloud that have never been carried out so far.

查看原文本刊更多论文

测量大量关联数据集连通性和质量的可扩展方法

尽管关联数据的最终目标是链接和集成，但目前尚不清楚当前的关联开放数据(LOD)云的连接程度如何。在本文中，我们将重点介绍由特殊索引和算法支持的方法，用于执行与两个以上数据集的连通性相关的测量，这些测量在各种任务中很有用，包括:(a)数据集发现和选择;(b)对象共同参考，即获取关于一组实体的完整信息，包括出处信息;(c)数据质量评估和改进，即评估任何一组数据集之间的连通性并监测其随时间的演变，以及估计数据的准确性;(d)数据集可视化;还有其他各种任务。由于以naïve的方式执行所有这些测量的成本非常高，因此在本文中，我们将介绍可以加速此类任务的索引(及其构造算法)。简而言之，我们介绍(i)一个基于名称空间的前缀索引，(ii)一个sameAs目录，用于计算数据集中遇到的owl:sameAs关系的对称和传递闭包，(iii)一个语义感知的元素索引(利用上述索引)，最后，(iv)两个基于格的增量算法，用于加速任何数据集的uri交集的计算。为了增强可扩展性，我们提出了并行索引构建算法和基于并行格的增量算法，我们评估了使用单个机器或机器集群实现的加速，并提供了有关影响效率因素的见解。最后，我们报告了迄今为止从未进行过的关于(十亿倍大小的)LOD云连通性的测量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量