An efficient approximation scheme for data mining tasks

Proceedings 17th International Conference on Data Engineering Pub Date : 2001-04-02 DOI:10.1109/ICDE.2001.914858

G. Kollios, D. Gunopulos, Nick Koudas, Stefan Berchtold

引用次数: 29

Abstract

We investigate the use of biased sampling according to the density of the dataset, to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional datasets. In density biased sampling, the probability that a given point will be included in the sample depends on the local density of the dataset. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest, and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.

查看原文本刊更多论文

数据挖掘任务的有效逼近方案

我们根据数据集的密度研究了偏差抽样的使用，以加快一般数据挖掘任务的操作，如大型多维数据集的聚类和离群点检测。在密度偏倚抽样中，一个给定点被包含在样本中的概率取决于数据集的局部密度。我们提出了一种用于密度偏差采样的通用技术，该技术可以考虑用户对感兴趣的属性进行采样的需求，并且可以针对特定的数据挖掘任务进行调优。与简单的随机抽样相比，这允许极大的灵活性和提高结果的准确性。我们详细描述了我们的方法，对其进行了分析评估，并展示了如何对近似聚类和离群值检测进行优化。最后，我们对所提出的方法进行了彻底的实验评估，在真实和合成数据集上应用密度偏差抽样，并采用聚类和离群值检测算法，从而突出了我们方法的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings 17th International Conference on Data Engineering

自引率

0.00%

发文量