Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints

Shaoxu Song, Lei Chen, Hong Cheng
{"title":"Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints","authors":"Shaoxu Song, Lei Chen, Hong Cheng","doi":"10.1109/ICDE.2012.46","DOIUrl":null,"url":null,"abstract":"The importance of introducing distance constraints to data dependencies, such as differential dependencies (DDs) [28], has recently been recognized. The metric distance constraints are tolerant to small variations, which enable them apply to wide data quality checking applications, such as detecting data violations. However, the determination of distance thresholds for the metric distance constraints is non-trivial. It often relies on a truth data instance which embeds the distance constraints. To find useful distance threshold patterns from data, there are several guidelines of statistical measures to specify, e.g., support, confidence and dependent quality. Unfortunately, given a data instance, users might not have any knowledge about the data distribution, thus it is very challenging to set the right parameters. In this paper, we study the determination of distance thresholds for metric distance constraints, in a parameter-free style. Specifically, we compute an expected utility based on the statistical measures from the data. According to our analysis as well as experimental verification, distance threshold patterns with higher expected utility could offer better usage in real applications, such as violation detection. We then develop efficient algorithms to determine the distance thresholds having the maximum expected utility. Finally, our extensive experimental evaluation demonstrates the effectiveness and efficiency of the proposed methods.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 28th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2012.46","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

The importance of introducing distance constraints to data dependencies, such as differential dependencies (DDs) [28], has recently been recognized. The metric distance constraints are tolerant to small variations, which enable them apply to wide data quality checking applications, such as detecting data violations. However, the determination of distance thresholds for the metric distance constraints is non-trivial. It often relies on a truth data instance which embeds the distance constraints. To find useful distance threshold patterns from data, there are several guidelines of statistical measures to specify, e.g., support, confidence and dependent quality. Unfortunately, given a data instance, users might not have any knowledge about the data distribution, thus it is very challenging to set the right parameters. In this paper, we study the determination of distance thresholds for metric distance constraints, in a parameter-free style. Specifically, we compute an expected utility based on the statistical measures from the data. According to our analysis as well as experimental verification, distance threshold patterns with higher expected utility could offer better usage in real applications, such as violation detection. We then develop efficient algorithms to determine the distance thresholds having the maximum expected utility. Finally, our extensive experimental evaluation demonstrates the effectiveness and efficiency of the proposed methods.
度量距离约束中距离阈值的无参数确定
最近,人们认识到对数据依赖关系(如差分依赖关系(dd))引入距离约束的重要性[28]。度量距离约束可以容忍小的变化,这使它们能够应用于广泛的数据质量检查应用程序,例如检测数据违例。然而,度量距离约束的距离阈值的确定是非平凡的。它通常依赖于嵌入距离约束的真值数据实例。为了从数据中找到有用的距离阈值模式,需要指定一些统计度量准则,例如支持度、置信度和依赖质量。不幸的是,给定一个数据实例,用户可能对数据分布没有任何了解,因此设置正确的参数非常具有挑战性。在本文中,我们以无参数的方式研究度量距离约束的距离阈值的确定。具体来说,我们根据数据的统计度量计算期望效用。根据我们的分析和实验验证,具有更高期望效用的距离阈值模式可以在实际应用中提供更好的用途,例如违规检测。然后,我们开发了有效的算法来确定具有最大期望效用的距离阈值。最后,我们进行了大量的实验评估,证明了所提出方法的有效性和效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信