Impact of regularization on spectral clustering

Antony Joseph, Bin Yu
{"title":"Impact of regularization on spectral clustering","authors":"Antony Joseph, Bin Yu","doi":"10.1109/ITA.2014.6804241","DOIUrl":null,"url":null,"abstract":"Summary form only given. Clustering in networks/graphs is an important problem with applications in the analysis of gene-gene interactions, social networks, text mining, to name a few. Spectral clustering is one of the more popular techniques for such purposes, chiefly due to its computational advantage and generality of application. The algorithm's generality arises from the fact that it is not tied to any modeling assumptions on the data, but is rooted in intuitive measures of community structure such as sparsest cut based measures (Hagen and Kahng (1992), Shi and Malik (2000), Ng. et. al (2002)).Here, we attempt to understand regularized form of spectral clustering. Our motivation for this work was empirical results in Amini et. al (2013) that showed that the performance of spectral clustering can greatly be improved via regularization. Here regularization entails adding a constant matrix to the adjacency matrix and calculating the corresponding Laplacian matrix. The value of the constant is called the regularization parameter. Our analysis is carried out under the stochastic block model (SBM) framework. Under the (SBM) (and its extensions). Previous results on spectral clustering (McSherry (2001), Dasgupta et. al. (2004), Rohe et. al (2011)) also assumed the SBM and relied on the minimum degree of the graph being sufficiently large to prove its good performance. By analyzing the spectrum of the Laplacian of an SBM as a function of the regularization parameter, we provide bounds for the perturbation of the regularized eigenvectors, which, in some situations, does not depend on the minimum degree. For example, in the two block SBM, our bounds depend inversely on the maximum degree, as opposed to the minimum degree. More importantly, we show the usefulness of regularization in the important practical situation where not all nodes can be clustered accurately. In such situations, in the absence of regularization, the top eigenvectors need not discriminate between the nodes which do belong to well-defined clusters. With a proper choice of regularization parameter, we demonstrate that top eigenvectors indeed discriminate between the well-defined clusters. A crucial ingredient in the above is the analysis of the spectrum of the Laplacian as a function of the regularization parameter. Assuming that there are K clusters, an adequate gap between the top K eigenvalues and the remaining eigenvalues, ensures that these clusters can be estimated well. Such a gap is commonly referred to as the eigen gap. In the situation considered in above paragraph, an adequate eigen gap may not exist for the unregularized Laplacian. We show that regularization works by creating a gap, allowing us to recover the clusters. As an important application of our bounds, we propose a data-driven technique DK-est (standing for estimated Davis-Kahn bounds) for choosing the regularization parameter. DK-est is shown to perform very well for simulated and real data sets.","PeriodicalId":338302,"journal":{"name":"2014 Information Theory and Applications Workshop (ITA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"151","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 Information Theory and Applications Workshop (ITA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITA.2014.6804241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 151

Abstract

Summary form only given. Clustering in networks/graphs is an important problem with applications in the analysis of gene-gene interactions, social networks, text mining, to name a few. Spectral clustering is one of the more popular techniques for such purposes, chiefly due to its computational advantage and generality of application. The algorithm's generality arises from the fact that it is not tied to any modeling assumptions on the data, but is rooted in intuitive measures of community structure such as sparsest cut based measures (Hagen and Kahng (1992), Shi and Malik (2000), Ng. et. al (2002)).Here, we attempt to understand regularized form of spectral clustering. Our motivation for this work was empirical results in Amini et. al (2013) that showed that the performance of spectral clustering can greatly be improved via regularization. Here regularization entails adding a constant matrix to the adjacency matrix and calculating the corresponding Laplacian matrix. The value of the constant is called the regularization parameter. Our analysis is carried out under the stochastic block model (SBM) framework. Under the (SBM) (and its extensions). Previous results on spectral clustering (McSherry (2001), Dasgupta et. al. (2004), Rohe et. al (2011)) also assumed the SBM and relied on the minimum degree of the graph being sufficiently large to prove its good performance. By analyzing the spectrum of the Laplacian of an SBM as a function of the regularization parameter, we provide bounds for the perturbation of the regularized eigenvectors, which, in some situations, does not depend on the minimum degree. For example, in the two block SBM, our bounds depend inversely on the maximum degree, as opposed to the minimum degree. More importantly, we show the usefulness of regularization in the important practical situation where not all nodes can be clustered accurately. In such situations, in the absence of regularization, the top eigenvectors need not discriminate between the nodes which do belong to well-defined clusters. With a proper choice of regularization parameter, we demonstrate that top eigenvectors indeed discriminate between the well-defined clusters. A crucial ingredient in the above is the analysis of the spectrum of the Laplacian as a function of the regularization parameter. Assuming that there are K clusters, an adequate gap between the top K eigenvalues and the remaining eigenvalues, ensures that these clusters can be estimated well. Such a gap is commonly referred to as the eigen gap. In the situation considered in above paragraph, an adequate eigen gap may not exist for the unregularized Laplacian. We show that regularization works by creating a gap, allowing us to recover the clusters. As an important application of our bounds, we propose a data-driven technique DK-est (standing for estimated Davis-Kahn bounds) for choosing the regularization parameter. DK-est is shown to perform very well for simulated and real data sets.
正则化对光谱聚类的影响
只提供摘要形式。网络/图中的聚类是基因-基因交互分析、社交网络、文本挖掘等应用中的一个重要问题。光谱聚类是用于此类目的的较流行的技术之一,主要是由于其计算优势和应用的普遍性。该算法的通用性源于这样一个事实,即它不依赖于对数据的任何建模假设,而是植根于对社区结构的直观度量,如基于稀疏切割的度量(Hagen and Kahng (1992), Shi and Malik (2000), Ng)。等人(2002))。在这里,我们试图理解光谱聚类的正则化形式。我们进行这项工作的动机是Amini等人(2013)的经验结果,该结果表明,通过正则化可以大大提高光谱聚类的性能。这里,正则化需要向邻接矩阵添加一个常数矩阵并计算相应的拉普拉斯矩阵。该常数的值称为正则化参数。我们的分析是在随机块模型(SBM)框架下进行的。根据(SBM)(及其扩展)。先前的光谱聚类结果(McSherry (2001), Dasgupta等人(2004),Rohe等人(2011))也假设了SBM,并依赖于图的最小程度足够大来证明其良好的性能。通过分析正则化参数对拉普拉斯算子谱的影响,给出了正则化特征向量扰动的范围,在某些情况下,该范围不依赖于最小度。例如,在两个块SBM中,我们的界与最大度成反比,而不是最小度。更重要的是,我们展示了正则化在并非所有节点都能准确聚类的重要实际情况下的有用性。在这种情况下,在没有正则化的情况下,顶部特征向量不需要区分属于定义良好的集群的节点。通过适当选择正则化参数,我们证明了顶部特征向量确实可以区分定义良好的聚类。上面的一个关键因素是拉普拉斯谱作为正则化参数的函数的分析。假设有K个聚类,在最上面的K个特征值和剩下的特征值之间有足够的差距,确保这些聚类可以很好地估计。这种间隙通常称为本征间隙。在上款所考虑的情况下,对于非正则拉普拉斯算子可能不存在足够的特征间隙。我们通过创建一个间隙来证明正则化是有效的,允许我们恢复集群。作为我们的界的一个重要应用,我们提出了一种数据驱动技术DK-est(代表估计的Davis-Kahn界)来选择正则化参数。DK-est在模拟和真实数据集上的表现都很好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信