估计条件函数依赖的置信度

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data Pub Date : 2009-06-29 DOI:10.1145/1559845.1559895

Graham Cormode, Lukasz Golab, Flip Korn, A. Mcgregor, D. Srivastava, Xi Zhang

{"title":"估计条件函数依赖的置信度","authors":"Graham Cormode, Lukasz Golab, Flip Korn, A. Mcgregor, D. Srivastava, Xi Zhang","doi":"10.1145/1559845.1559895","DOIUrl":null,"url":null,"abstract":"Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and the extent to which it satisfies the CFD)gives valuable information about data semantics and data quality. While computing the support is easier, computing the confidence exactly is expensive if the relation is large, and estimating it from a random sample of the relation is unreliable unless the sample is large. We study how to efficiently estimate the confidence of a CFD with a small number of passes (one or two) over the input using small space. Our solutions are based on a variety of sampling and sketching techniques, and apply when the pattern tableau is known in advance, and also the harder case when this is given after the data have been seen. We analyze our algorithms, and show that they can guarantee a small additive error; we also show that relative errors guarantees are not possible. We demonstrate the power of these methods empirically, with a detailed study using both real and synthetic data. These experiments show that it is possible to estimate the CFD confidence very accurately with summaries which are much smaller than the size of the data they represent.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"178 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":"{\"title\":\"Estimating the confidence of conditional functional dependencies\",\"authors\":\"Graham Cormode, Lukasz Golab, Flip Korn, A. Mcgregor, D. Srivastava, Xi Zhang\",\"doi\":\"10.1145/1559845.1559895\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and the extent to which it satisfies the CFD)gives valuable information about data semantics and data quality. While computing the support is easier, computing the confidence exactly is expensive if the relation is large, and estimating it from a random sample of the relation is unreliable unless the sample is large. We study how to efficiently estimate the confidence of a CFD with a small number of passes (one or two) over the input using small space. Our solutions are based on a variety of sampling and sketching techniques, and apply when the pattern tableau is known in advance, and also the harder case when this is given after the data have been seen. We analyze our algorithms, and show that they can guarantee a small additive error; we also show that relative errors guarantees are not possible. We demonstrate the power of these methods empirically, with a detailed study using both real and synthetic data. These experiments show that it is possible to estimate the CFD confidence very accurately with summaries which are much smaller than the size of the data they represent.\",\"PeriodicalId\":344093,\"journal\":{\"name\":\"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data\",\"volume\":\"178 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-06-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"56\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1559845.1559895\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1559845.1559895","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 56

摘要

条件功能依赖(cfd)最近被提出作为经典功能依赖的扩展，应用于关系的特定子集，由模式表指定。计算CFD的支持度和置信度(即，适用子集的大小和它满足CFD的程度)提供了关于数据语义和数据质量的有价值的信息。虽然计算支持度更容易，但如果关系很大，精确计算置信度是昂贵的，并且从关系的随机样本中估计它是不可靠的，除非样本很大。我们研究了如何在很小的空间内，通过少量的输入通道(一次或两次)有效地估计CFD的置信度。我们的解决方案基于各种采样和素描技术，适用于提前知道模式表的情况，也适用于在看到数据后给出的更困难的情况。我们分析了我们的算法，并表明它们可以保证一个小的加性误差;我们还表明，相对误差保证是不可能的。我们通过使用真实和合成数据的详细研究，从经验上证明了这些方法的力量。这些实验表明，用比它们所代表的数据大小小得多的摘要来非常准确地估计CFD置信度是可能的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Estimating the confidence of conditional functional dependencies

Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and the extent to which it satisfies the CFD)gives valuable information about data semantics and data quality. While computing the support is easier, computing the confidence exactly is expensive if the relation is large, and estimating it from a random sample of the relation is unreliable unless the sample is large. We study how to efficiently estimate the confidence of a CFD with a small number of passes (one or two) over the input using small space. Our solutions are based on a variety of sampling and sketching techniques, and apply when the pattern tableau is known in advance, and also the harder case when this is given after the data have been seen. We analyze our algorithms, and show that they can guarantee a small additive error; we also show that relative errors guarantees are not possible. We demonstrate the power of these methods empirically, with a detailed study using both real and synthetic data. These experiments show that it is possible to estimate the CFD confidence very accurately with summaries which are much smaller than the size of the data they represent.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

自引率

0.00%

发文量