不精确功能依赖的增量发现

Journal of Data and Information Quality (JDIQ) Pub Date : 2020-10-15 DOI:10.1145/3397462

Loredana Caruccio, Stefano Cirillo

{"title":"不精确功能依赖的增量发现","authors":"Loredana Caruccio, Stefano Cirillo","doi":"10.1145/3397462","DOIUrl":null,"url":null,"abstract":"Functional dependencies (fds) are one of the metadata used to assess data quality and to perform data cleaning operations. However, to pursue robustness with respect to data errors, it has been necessary to devise imprecise versions of functional dependencies, yielding relaxed functional dependencies (rfds). Among them, there exists the class of rfds relaxing on the extent, i.e., those admitting the possibility that an fd holds on a subset of data. In the literature, several algorithms to automatically discover rfds from big data collections have been defined. They achieve good performances with respect to the inherent problem complexity. However, most of them are capable of discovering rfds only by batch processing the entire dataset. This is not suitable in the era of big data, where the size of a database instance can grow with high-velocity, and the insertion of new data can invalidate previously holding rfds. Thus, it is necessary to devise incremental discovery algorithms capable of updating the set of holding rfds upon data insertions, without processing the entire dataset. To this end, in this article we propose an incremental discovery algorithm for rfds relaxing on the extent. It manages the validation of candidate rfds and the generation of possibly new rfd candidates upon the insertion of the new tuples, while limiting the size of the overall search space. Experimental results show that the proposed algorithm achieves extremely good performances on real-world datasets.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"55 1","pages":"1 - 25"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Incremental Discovery of Imprecise Functional Dependencies\",\"authors\":\"Loredana Caruccio, Stefano Cirillo\",\"doi\":\"10.1145/3397462\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Functional dependencies (fds) are one of the metadata used to assess data quality and to perform data cleaning operations. However, to pursue robustness with respect to data errors, it has been necessary to devise imprecise versions of functional dependencies, yielding relaxed functional dependencies (rfds). Among them, there exists the class of rfds relaxing on the extent, i.e., those admitting the possibility that an fd holds on a subset of data. In the literature, several algorithms to automatically discover rfds from big data collections have been defined. They achieve good performances with respect to the inherent problem complexity. However, most of them are capable of discovering rfds only by batch processing the entire dataset. This is not suitable in the era of big data, where the size of a database instance can grow with high-velocity, and the insertion of new data can invalidate previously holding rfds. Thus, it is necessary to devise incremental discovery algorithms capable of updating the set of holding rfds upon data insertions, without processing the entire dataset. To this end, in this article we propose an incremental discovery algorithm for rfds relaxing on the extent. It manages the validation of candidate rfds and the generation of possibly new rfd candidates upon the insertion of the new tuples, while limiting the size of the overall search space. Experimental results show that the proposed algorithm achieves extremely good performances on real-world datasets.\",\"PeriodicalId\":15582,\"journal\":{\"name\":\"Journal of Data and Information Quality (JDIQ)\",\"volume\":\"55 1\",\"pages\":\"1 - 25\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Quality (JDIQ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3397462\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397462","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

摘要

功能依赖关系(fds)是用于评估数据质量和执行数据清理操作的元数据之一。然而，为了追求数据错误方面的健壮性，有必要设计不精确版本的功能依赖关系，从而产生宽松的功能依赖关系(rfds)。其中，存在一类在程度上放松的fd，即承认fd对数据子集持有的可能性。在文献中，已经定义了几种从大数据集合中自动发现rfds的算法。它们在固有问题复杂性方面取得了良好的性能。然而，它们中的大多数只能通过批处理整个数据集来发现rfds。这在大数据时代是不合适的，因为数据库实例的大小可以高速增长，而插入的新数据可能会使以前持有的数据无效。因此，有必要设计能够在数据插入时更新持有的rfd集的增量发现算法，而无需处理整个数据集。为此，在本文中，我们提出了一种基于程度放松的rfds增量发现算法。它管理候选rfd的验证，并在插入新元组时生成可能的新rfd候选，同时限制整个搜索空间的大小。实验结果表明，该算法在真实数据集上取得了非常好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Incremental Discovery of Imprecise Functional Dependencies

Functional dependencies (fds) are one of the metadata used to assess data quality and to perform data cleaning operations. However, to pursue robustness with respect to data errors, it has been necessary to devise imprecise versions of functional dependencies, yielding relaxed functional dependencies (rfds). Among them, there exists the class of rfds relaxing on the extent, i.e., those admitting the possibility that an fd holds on a subset of data. In the literature, several algorithms to automatically discover rfds from big data collections have been defined. They achieve good performances with respect to the inherent problem complexity. However, most of them are capable of discovering rfds only by batch processing the entire dataset. This is not suitable in the era of big data, where the size of a database instance can grow with high-velocity, and the insertion of new data can invalidate previously holding rfds. Thus, it is necessary to devise incremental discovery algorithms capable of updating the set of holding rfds upon data insertions, without processing the entire dataset. To this end, in this article we propose an incremental discovery algorithm for rfds relaxing on the extent. It manages the validation of candidate rfds and the generation of possibly new rfd candidates upon the insertion of the new tuples, while limiting the size of the overall search space. Experimental results show that the proposed algorithm achieves extremely good performances on real-world datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量