An Improved Cost Function for Hierarchical Cluster Trees

Q4 Mathematics
Dingkang Wang, Yusu Wang
{"title":"An Improved Cost Function for Hierarchical Cluster Trees","authors":"Dingkang Wang, Yusu Wang","doi":"10.20382/jocg.v11i1a11","DOIUrl":null,"url":null,"abstract":"Hierarchical clustering has been a popular method in various data analysis applications. It partitions a data set into a hierarchical collection of clusters, and can provide a global view of (cluster) structure behind data across different granularity levels. A hierarchical clustering (HC) of a data set can be naturally represented by a tree, called a HC-tree, where leaves correspond to input data and subtrees rooted at internal nodes correspond to clusters. Many hierarchical clustering algorithms used in practice are developed in a procedure manner. Dasgupta proposed to study the hierarchical clustering problem from an optimization point of view, and introduced an intuitive cost function for similarity-based hierarchical clustering with nice properties as well as natural approximation algorithms. \nWe observe that while Dasgupta's cost function is effective at differentiating a good HC-tree from a bad one for a fixed graph, the value of this cost function does not reflect how well an input similarity graph is consistent to a hierarchical structure. In this paper, we present a new cost function, which is developed based on Dasgupta's cost function, to address this issue. The optimal tree under the new cost function remains the same as the one under Dasgupta's cost function. However, the value of our cost function is more meaningful. The new way of formulating the cost function also leads to a polynomial time algorithm to compute the optimal cluster tree when the input graph has a perfect HC-structure, or an approximation algorithm when the input graph 'almost' has a perfect HC-structure. Finally, we provide further understanding of the new cost function by studying its behavior for random graphs sampled from an edge probability matrix.","PeriodicalId":54969,"journal":{"name":"International Journal of Computational Geometry & Applications","volume":"40 1","pages":"283-331"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computational Geometry & Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.20382/jocg.v11i1a11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 15

Abstract

Hierarchical clustering has been a popular method in various data analysis applications. It partitions a data set into a hierarchical collection of clusters, and can provide a global view of (cluster) structure behind data across different granularity levels. A hierarchical clustering (HC) of a data set can be naturally represented by a tree, called a HC-tree, where leaves correspond to input data and subtrees rooted at internal nodes correspond to clusters. Many hierarchical clustering algorithms used in practice are developed in a procedure manner. Dasgupta proposed to study the hierarchical clustering problem from an optimization point of view, and introduced an intuitive cost function for similarity-based hierarchical clustering with nice properties as well as natural approximation algorithms. We observe that while Dasgupta's cost function is effective at differentiating a good HC-tree from a bad one for a fixed graph, the value of this cost function does not reflect how well an input similarity graph is consistent to a hierarchical structure. In this paper, we present a new cost function, which is developed based on Dasgupta's cost function, to address this issue. The optimal tree under the new cost function remains the same as the one under Dasgupta's cost function. However, the value of our cost function is more meaningful. The new way of formulating the cost function also leads to a polynomial time algorithm to compute the optimal cluster tree when the input graph has a perfect HC-structure, or an approximation algorithm when the input graph 'almost' has a perfect HC-structure. Finally, we provide further understanding of the new cost function by studying its behavior for random graphs sampled from an edge probability matrix.
一种改进的层次聚类树代价函数
在各种数据分析应用中,分层聚类已经成为一种流行的方法。它将数据集划分为集群的分层集合,并且可以跨不同粒度级别提供数据背后(集群)结构的全局视图。数据集的分层聚类(HC)可以自然地用树表示,称为HC树,其中叶子对应于输入数据,而植根于内部节点的子树对应于集群。在实践中使用的许多分层聚类算法都是以过程的方式开发的。Dasgupta提出从优化的角度研究分层聚类问题,并引入了一种直观的成本函数,用于基于相似性的分层聚类,具有良好的性能和自然逼近算法。我们观察到,虽然Dasgupta的成本函数在区分固定图的好hc树和坏hc树方面是有效的,但这个成本函数的值并不能反映输入相似图与层次结构的一致性。本文在Dasgupta成本函数的基础上,提出了一种新的成本函数来解决这个问题。新成本函数下的最优树与Dasgupta成本函数下的最优树相同。然而,我们的成本函数的值更有意义。当输入图具有完美的hc结构时,计算最优聚类树的多项式时间算法,或者当输入图“几乎”具有完美的hc结构时,计算最优聚类树的近似算法。最后,我们通过研究从边缘概率矩阵中采样的随机图的行为来进一步理解新的成本函数。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
0.80
自引率
0.00%
发文量
4
审稿时长
>12 weeks
期刊介绍: The International Journal of Computational Geometry & Applications (IJCGA) is a quarterly journal devoted to the field of computational geometry within the framework of design and analysis of algorithms. Emphasis is placed on the computational aspects of geometric problems that arise in various fields of science and engineering including computer-aided geometry design (CAGD), computer graphics, constructive solid geometry (CSG), operations research, pattern recognition, robotics, solid modelling, VLSI routing/layout, and others. Research contributions ranging from theoretical results in algorithm design — sequential or parallel, probabilistic or randomized algorithms — to applications in the above-mentioned areas are welcome. Research findings or experiences in the implementations of geometric algorithms, such as numerical stability, and papers with a geometric flavour related to algorithms or the application areas of computational geometry are also welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信