使用基于类别的遵守集群市场篮子数据

2002 IEEE International Conference on Data Mining, 2002. Proceedings. Pub Date : 2002-12-09 DOI:10.1109/ICDM.2002.1184000

Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen

{"title":"使用基于类别的遵守集群市场篮子数据","authors":"Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen","doi":"10.1109/ICDM.2002.1184000","DOIUrl":null,"url":null,"abstract":"We devise an efficient algorithm for clustering market-basket data. Different from those of the traditional data, the features of market-basket data are known to be of high dimensionality, sparsity, and with massive outliers. Without explicitly considering the presence of the taxonomy, most prior efforts on clustering market-basket data can be viewed as dealing with items in the leaf level of the taxonomy tree. Clustering transactions across different levels of the taxonomy is of great importance for marketing strategies as well as for the result representation of the clustering techniques for market-basket data. In view of the features of market-basket data, we devise a measurement, called the category-based adherence, and utilize this measurement to perform the clustering. The distance of an item to a given cluster is defined as the number of links between this item and its nearest large node in the taxonomy tree where a large node is an item or a category node whose occurrence count exceeds a given threshold. The category-based adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm CBA, for market-basket data with the objective to minimize the category-based adherence. A validation model based on information gain is also devised to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, algorithm CBA significantly outperforms the prior works in both the execution efficiency and the clustering quality for market-basket data.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Using category-based adherence to cluster market-basket data\",\"authors\":\"Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen\",\"doi\":\"10.1109/ICDM.2002.1184000\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We devise an efficient algorithm for clustering market-basket data. Different from those of the traditional data, the features of market-basket data are known to be of high dimensionality, sparsity, and with massive outliers. Without explicitly considering the presence of the taxonomy, most prior efforts on clustering market-basket data can be viewed as dealing with items in the leaf level of the taxonomy tree. Clustering transactions across different levels of the taxonomy is of great importance for marketing strategies as well as for the result representation of the clustering techniques for market-basket data. In view of the features of market-basket data, we devise a measurement, called the category-based adherence, and utilize this measurement to perform the clustering. The distance of an item to a given cluster is defined as the number of links between this item and its nearest large node in the taxonomy tree where a large node is an item or a category node whose occurrence count exceeds a given threshold. The category-based adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm CBA, for market-basket data with the objective to minimize the category-based adherence. A validation model based on information gain is also devised to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, algorithm CBA significantly outperforms the prior works in both the execution efficiency and the clustering quality for market-basket data.\",\"PeriodicalId\":405340,\"journal\":{\"name\":\"2002 IEEE International Conference on Data Mining, 2002. Proceedings.\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2002 IEEE International Conference on Data Mining, 2002. Proceedings.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM.2002.1184000\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2002.1184000","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

我们设计了一种高效的聚类算法。与传统数据不同，菜篮子数据具有高维数、稀疏度和大量异常值的特点。在没有明确考虑分类法存在的情况下，大多数先前对市场篮数据进行聚类的工作可以被视为处理分类法树的叶级中的项目。跨不同层次分类法的聚类交易对于营销策略以及市场篮数据聚类技术的结果表示非常重要。鉴于市场篮子数据的特点，我们设计了一种测量方法，称为基于类别的依从性，并利用这种测量方法进行聚类。项目到给定集群的距离定义为该项目与其在分类法树中最近的大节点之间的链接数，其中大节点是出现次数超过给定阈值的项目或类别节点。然后将事务与集群的基于类别的依存性定义为该事务中项目与该集群的平均距离。使用这种基于类别的依存性度量，我们开发了一种高效的聚类算法，称为算法CBA，用于市场篮子数据，目标是最小化基于类别的依存性。本文还设计了一个基于信息增益的验证模型来评估市场篮数据的聚类质量。经过真实数据集和合成数据集的验证，我们的实验结果表明，在分类信息的基础上，CBA算法在市场篮子数据的执行效率和聚类质量上都明显优于现有算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Using category-based adherence to cluster market-basket data

We devise an efficient algorithm for clustering market-basket data. Different from those of the traditional data, the features of market-basket data are known to be of high dimensionality, sparsity, and with massive outliers. Without explicitly considering the presence of the taxonomy, most prior efforts on clustering market-basket data can be viewed as dealing with items in the leaf level of the taxonomy tree. Clustering transactions across different levels of the taxonomy is of great importance for marketing strategies as well as for the result representation of the clustering techniques for market-basket data. In view of the features of market-basket data, we devise a measurement, called the category-based adherence, and utilize this measurement to perform the clustering. The distance of an item to a given cluster is defined as the number of links between this item and its nearest large node in the taxonomy tree where a large node is an item or a category node whose occurrence count exceeds a given threshold. The category-based adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm CBA, for market-basket data with the objective to minimize the category-based adherence. A validation model based on information gain is also devised to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, algorithm CBA significantly outperforms the prior works in both the execution efficiency and the clustering quality for market-basket data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2002 IEEE International Conference on Data Mining, 2002. Proceedings.

自引率

0.00%

发文量