EulerFD: An Efficient Double-Cycle Approximation of Functional Dependencies

2023 IEEE 39th International Conference on Data Engineering (ICDE) Pub Date : 2023-04-01 DOI:10.1109/ICDE55515.2023.00220

Qiongqiong Lin, Yunfan Gu, Jing Sai, Jinfei Liu, Kui Ren, Li Xiong, Tianzhen Wang, Yanbei Pang, Sheng Wang, Feifei Li

{"title":"EulerFD: An Efficient Double-Cycle Approximation of Functional Dependencies","authors":"Qiongqiong Lin, Yunfan Gu, Jing Sai, Jinfei Liu, Kui Ren, Li Xiong, Tianzhen Wang, Yanbei Pang, Sheng Wang, Feifei Li","doi":"10.1109/ICDE55515.2023.00220","DOIUrl":null,"url":null,"abstract":"Functional dependencies (FDs) have been extensively employed in discovering inferential relationships in databases, which provide feasible approaches for many data mining tasks, such as data obfuscation, query optimization, and schema normalization. Since the explosive growth of data leads to a rapid increase of FDs on large datasets, existing algorithms that pay more attention to the exact FD discovery cannot extract FDs efficiently. To bridge this gap, we propose an Efficient double-cycle approximation of Functional Dependency (EulerFD) discovery algorithm, which ensures both efficiency and accuracy of FD discovery. EulerFD induces FDs from invalid ones as invalidating an FD only requires comparing and verifying some pairs of tuples (that violate the dependency) while validating an FD requires examining and verifying all tuples. Considering the abundant tuple pairs in large datasets, a novel sampling strategy is employed in EulerFD to quickly extract invalid FDs by revising the sampling range according to previous sampling results. Furthermore, EulerFD evaluates the stopping criteria in a double-cycle structure as feedback for further sampling. The sampling strategy and the double-cycle structure complement each other to achieve a more efficient sampling effect. Experimental results on real-world and synthetic datasets, especially the massive datasets from DMS of Alibaba Cloud, justify the design and verify the efficiency and effectiveness of the proposed EulerFD.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE55515.2023.00220","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Functional dependencies (FDs) have been extensively employed in discovering inferential relationships in databases, which provide feasible approaches for many data mining tasks, such as data obfuscation, query optimization, and schema normalization. Since the explosive growth of data leads to a rapid increase of FDs on large datasets, existing algorithms that pay more attention to the exact FD discovery cannot extract FDs efficiently. To bridge this gap, we propose an Efficient double-cycle approximation of Functional Dependency (EulerFD) discovery algorithm, which ensures both efficiency and accuracy of FD discovery. EulerFD induces FDs from invalid ones as invalidating an FD only requires comparing and verifying some pairs of tuples (that violate the dependency) while validating an FD requires examining and verifying all tuples. Considering the abundant tuple pairs in large datasets, a novel sampling strategy is employed in EulerFD to quickly extract invalid FDs by revising the sampling range according to previous sampling results. Furthermore, EulerFD evaluates the stopping criteria in a double-cycle structure as feedback for further sampling. The sampling strategy and the double-cycle structure complement each other to achieve a more efficient sampling effect. Experimental results on real-world and synthetic datasets, especially the massive datasets from DMS of Alibaba Cloud, justify the design and verify the efficiency and effectiveness of the proposed EulerFD.

查看原文本刊更多论文

EulerFD:函数依赖的有效双环逼近

功能依赖关系(fd)已被广泛用于发现数据库中的推断关系，它为许多数据挖掘任务(如数据混淆、查询优化和模式规范化)提供了可行的方法。由于数据的爆炸式增长导致大数据集上FD的快速增加，现有的算法更注重FD的精确发现，无法有效地提取FD。为了弥补这一差距，我们提出了一种高效的函数依赖双环近似(EulerFD)发现算法，该算法保证了FD发现的效率和准确性。EulerFD从无效的元组中导出FD，因为使FD无效只需要比较和验证一些元组对(违反依赖关系)，而验证FD需要检查和验证所有元组。考虑到大数据集中存在大量的元组对，EulerFD采用了一种新的采样策略，根据之前的采样结果修正采样范围，快速提取无效fd。此外，EulerFD评估双循环结构中的停止准则作为进一步采样的反馈。采样策略与双循环结构相辅相成，实现了更高效的采样效果。在真实数据集和合成数据集上的实验结果，特别是来自阿里云DMS的海量数据集，证明了设计的合理性，并验证了所提出的EulerFD的效率和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE 39th International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量