Probabilistic Similarity Query on Dimension Incomplete Data

2009 Ninth IEEE International Conference on Data Mining Pub Date : 2009-12-06 DOI:10.1109/ICDM.2009.72

Wei-min Cheng, Xiaoming Jin, Jian-Tao Sun

{"title":"Probabilistic Similarity Query on Dimension Incomplete Data","authors":"Wei-min Cheng, Xiaoming Jin, Jian-Tao Sun","doi":"10.1109/ICDM.2009.72","DOIUrl":null,"url":null,"abstract":"Retrieving similar data has drawn many research efforts in the literature due to its importance in data mining, database and information retrieval. This problem is challenging when the data is incomplete. In previous research, data incompleteness refers to the fact that data values for some dimensions are unknown. However, in many practical applications (e.g., data collection by sensor network under bad environment), not only data values but even data dimension information may also be missing, which will make most similarity query algorithms infeasible. In this work, we propose the novel similarity query problem on dimension incomplete data and adopt a probabilistic framework to model this problem. For this problem, users can give a distance threshold and a probability threshold to specify their retrieval requirements. The distance threshold is used to specify the allowed distance between query and data objects and the probability threshold is used to require that the retrieval results satisfy the distance condition at least with the given probability. Instead of enumerating all possible cases to recover the missed dimensions, we propose an efficient approach to speed up the retrieval process by leveraging the inherent relations between query and dimension incomplete data objects. During the query process, we estimate the lower/upper bounds of the probability that the query is satisfied by a given data object, and utilize these bounds to filter irrelevant data objects efficiently. Furthermore, a probability triangle inequality is proposed to further speed up query processing. According to our experiments on real data sets, the proposed similarity query method is verified to be effective and efficient on dimension incomplete data.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Ninth IEEE International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2009.72","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Retrieving similar data has drawn many research efforts in the literature due to its importance in data mining, database and information retrieval. This problem is challenging when the data is incomplete. In previous research, data incompleteness refers to the fact that data values for some dimensions are unknown. However, in many practical applications (e.g., data collection by sensor network under bad environment), not only data values but even data dimension information may also be missing, which will make most similarity query algorithms infeasible. In this work, we propose the novel similarity query problem on dimension incomplete data and adopt a probabilistic framework to model this problem. For this problem, users can give a distance threshold and a probability threshold to specify their retrieval requirements. The distance threshold is used to specify the allowed distance between query and data objects and the probability threshold is used to require that the retrieval results satisfy the distance condition at least with the given probability. Instead of enumerating all possible cases to recover the missed dimensions, we propose an efficient approach to speed up the retrieval process by leveraging the inherent relations between query and dimension incomplete data objects. During the query process, we estimate the lower/upper bounds of the probability that the query is satisfied by a given data object, and utilize these bounds to filter irrelevant data objects efficiently. Furthermore, a probability triangle inequality is proposed to further speed up query processing. According to our experiments on real data sets, the proposed similarity query method is verified to be effective and efficient on dimension incomplete data.

查看原文本刊更多论文

维数不完全数据的概率相似性查询

由于检索相似数据在数据挖掘、数据库和信息检索中的重要性，在文献中引起了许多研究的努力。当数据不完整时，这个问题很有挑战性。在以往的研究中，数据不完备是指某些维度的数据值是未知的。然而，在许多实际应用中(如恶劣环境下的传感器网络数据采集)，不仅数据值缺失，甚至数据维度信息也可能缺失，这将使大多数相似度查询算法无法实现。在本文中，我们提出了一种新的维度不完备数据的相似度查询问题，并采用概率框架对该问题进行建模。对于这个问题，用户可以给出一个距离阈值和一个概率阈值来指定他们的检索需求。距离阈值用于指定查询和数据对象之间允许的距离，概率阈值用于要求检索结果至少以给定的概率满足距离条件。我们提出了一种有效的方法，通过利用查询和维度不完整数据对象之间的内在关系来加快检索过程，而不是列举所有可能的情况来恢复丢失的维度。在查询过程中，我们估计给定数据对象满足查询的概率的下界/上界，并利用这些边界有效地过滤不相关的数据对象。在此基础上，提出了一种概率三角不等式，进一步提高了查询的处理速度。通过在真实数据集上的实验，验证了所提出的相似度查询方法在维数不完全数据上的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 Ninth IEEE International Conference on Data Mining

自引率

0.00%

发文量