/spl delta/-clusters: capturing subspace correlation in a large data set

Proceedings 18th International Conference on Data Engineering Pub Date : 2002-08-07 DOI:10.1109/ICDE.2002.994771

Jiong Yang, Wei Wang, Haixun Wang, Philip S. Yu

{"title":"/spl delta/-clusters: capturing subspace correlation in a large data set","authors":"Jiong Yang, Wei Wang, Haixun Wang, Philip S. Yu","doi":"10.1109/ICDE.2002.994771","DOIUrl":null,"url":null,"abstract":"Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bio-informatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bio-informatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. We introduce a more general model, referred to as the /spl delta/-cluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A move-based algorithm (FLOC) is devised to efficiently produce a near-optimal clustering results. The /spl delta/-cluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the /spl delta/-cluster model and the FLOC algorithm on a number of real and synthetic data sets.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"364","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 18th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2002.994771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 364

Abstract

Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bio-informatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bio-informatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. We introduce a more general model, referred to as the /spl delta/-cluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A move-based algorithm (FLOC) is devised to efficiently produce a near-optimal clustering results. The /spl delta/-cluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the /spl delta/-cluster model and the FLOC algorithm on a number of real and synthetic data sets.

查看原文本刊更多论文

/spl delta/-clusters:捕获大型数据集中的子空间相关性

聚类是近年来一个具有重要实际意义的活跃研究领域。大多数以前的聚类模型都专注于在(子)维度集(例如，子空间集群)上对具有相似值的对象进行分组，并假设每个对象在每个维度上都有一个相关值(例如，双集群)。这些现有的聚类模型可能并不总是足以捕获对象之间表现出的一致性。一组对象(在属性子集上)之间可能仍然存在强相干性，即使它们在每个属性上采用完全不同的值，并且属性值没有完全指定。这在许多应用中非常常见，包括生物信息学分析以及协同过滤分析，其中数据可能不完整且容易受到偏差的影响。在生物信息学中，最近提出了一种双聚类模型来捕获属性子集之间的一致性。我们引入了一个更通用的模型，称为/spl delta/-cluster模型，以捕获对象子集在属性子集上表现出的一致性，同时允许缺席属性值。为了有效地产生接近最优的聚类结果，设计了基于移动的聚类算法(FLOC)。/spl delta/-cluster模型将双聚类模型作为特例，FLOC算法的性能远优于双聚类算法。我们在许多真实和合成数据集上证明了/spl delta/-簇模型和FLOC算法的正确性和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings 18th International Conference on Data Engineering

自引率

0.00%

发文量