High performance data mining using the nearest neighbor join

2002 IEEE International Conference on Data Mining, 2002. Proceedings. Pub Date : 2002-12-09 DOI:10.1109/ICDM.2002.1183884

C. Böhm, Florian Krebs

{"title":"High performance data mining using the nearest neighbor join","authors":"C. Böhm, Florian Krebs","doi":"10.1109/ICDM.2002.1183884","DOIUrl":null,"url":null,"abstract":"The similarity join has become an important database primitive to support similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance join which retrieves the k most similar pairs. In this paper, we investigate an important, third similarity join operation called k-nearest neighbor join which combines each point Of one point set with its k nearest neighbors in the other set. It has been shown that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbor classification, data cleansing, postprocessing of sampling-based data mining etc. can be implemented on top of the k-nn join operation to achieve performance improvements without affecting the quality of the result of these algorithms. We propose a new algorithm to compute the k-nearest neighbor join using the multipage index (MuX), a specialized index structure for the similarity join. To reduce both CPU and I/O cost, we develop optimal loading and processing strategies.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"4021 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"46","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2002.1183884","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 46

Abstract

The similarity join has become an important database primitive to support similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance join which retrieves the k most similar pairs. In this paper, we investigate an important, third similarity join operation called k-nearest neighbor join which combines each point Of one point set with its k nearest neighbors in the other set. It has been shown that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbor classification, data cleansing, postprocessing of sampling-based data mining etc. can be implemented on top of the k-nn join operation to achieve performance improvements without affecting the quality of the result of these algorithms. We propose a new algorithm to compute the k-nearest neighbor join using the multipage index (MuX), a specialized index structure for the similarity join. To reduce both CPU and I/O cost, we develop optimal loading and processing strategies.

查看原文本刊更多论文

使用最近邻连接的高性能数据挖掘

相似性连接已成为支持相似性搜索和数据挖掘的重要数据库原语。相似性连接将两组复杂对象组合在一起，使得结果包含所有相似对象对。众所周知，有两种类型的相似性连接，一种是距离范围连接(用户为连接定义距离阈值)，另一种是最近点查询或k-距离连接(检索k个最相似对)。本文研究了一种重要的第三类相似连接操作，称为k近邻连接，它将一个点集中的每个点与另一个点集中的k个近邻相结合。研究表明，许多标准的数据库知识发现(KDD)算法，如k-means和k- medium聚类、最近邻分类、数据清理、基于采样的数据挖掘后处理等，都可以在k-nn连接操作的基础上实现，从而在不影响这些算法结果质量的情况下实现性能改进。我们提出了一种新的算法来计算k近邻连接使用多页索引(MuX)，一个专门的索引结构的相似连接。为了降低CPU和I/O成本，我们开发了最佳加载和处理策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2002 IEEE International Conference on Data Mining, 2002. Proceedings.

自引率

0.00%

发文量