CDEC: a constrained deep embedded clustering

Elham Amirizadeh, R. Boostani
{"title":"CDEC: a constrained deep embedded clustering","authors":"Elham Amirizadeh, R. Boostani","doi":"10.1108/ijicc-03-2021-0053","DOIUrl":null,"url":null,"abstract":"PurposeThe aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.Design/methodology/approachIn data mining, semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data; one reason is that the data labeling is expensive, and semisupervised learning does not need all labels. One type of semisupervised learning is constrained clustering; this type of learning does not use class labels for clustering. Instead, it uses information of some pairs of instances (side information), and these instances maybe are in the same cluster (must-link [ML]) or in different clusters (cannot-link [CL]). Constrained clustering was studied extensively; however, little works have focused on constrained clustering for big datasets. In this paper, the authors have presented a constrained clustering for big datasets, and the method uses a DNN. The authors inject the constraints (ML and CL) to this DNN to promote the clustering performance and call it constrained deep embedded clustering (CDEC). In this manner, an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback–Leibler divergence objective function, which captures the constraints in order to cluster the projected samples. The proposed CDEC has been compared with the adversarial autoencoder, constrained 1-spectral clustering and autoencoder + k-means was applied to the known MNIST, Reuters-10k and USPS datasets, and their performance were assessed in terms of clustering accuracy. Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.FindingsFirst of all, this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension. Second, the author defined a formula to inject side information to the DNN. Third, the proposed method improves clustering performance and network convergence speed.Originality/valueLittle works have focused on constrained clustering for big datasets; also, the studies in DNNs for clustering, with specific loss function that simultaneously extract features and clustering the data, are rare. The method improves the performance of big data clustering without using labels, and it is important because the data labeling is expensive and time-consuming, especially for big datasets.","PeriodicalId":352072,"journal":{"name":"Int. J. Intell. Comput. Cybern.","volume":"310 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Intell. Comput. Cybern.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ijicc-03-2021-0053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

PurposeThe aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.Design/methodology/approachIn data mining, semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data; one reason is that the data labeling is expensive, and semisupervised learning does not need all labels. One type of semisupervised learning is constrained clustering; this type of learning does not use class labels for clustering. Instead, it uses information of some pairs of instances (side information), and these instances maybe are in the same cluster (must-link [ML]) or in different clusters (cannot-link [CL]). Constrained clustering was studied extensively; however, little works have focused on constrained clustering for big datasets. In this paper, the authors have presented a constrained clustering for big datasets, and the method uses a DNN. The authors inject the constraints (ML and CL) to this DNN to promote the clustering performance and call it constrained deep embedded clustering (CDEC). In this manner, an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback–Leibler divergence objective function, which captures the constraints in order to cluster the projected samples. The proposed CDEC has been compared with the adversarial autoencoder, constrained 1-spectral clustering and autoencoder + k-means was applied to the known MNIST, Reuters-10k and USPS datasets, and their performance were assessed in terms of clustering accuracy. Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.FindingsFirst of all, this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension. Second, the author defined a formula to inject side information to the DNN. Third, the proposed method improves clustering performance and network convergence speed.Originality/valueLittle works have focused on constrained clustering for big datasets; also, the studies in DNNs for clustering, with specific loss function that simultaneously extract features and clustering the data, are rare. The method improves the performance of big data clustering without using labels, and it is important because the data labeling is expensive and time-consuming, especially for big datasets.
CDEC:一种约束的深度嵌入聚类
本研究的目的是提出一种利用侧信息改进大数据集聚类结果的深度神经网络(DNN)方法;此外,作者还表明,使用这些信息可以提高聚类的性能,并提高网络训练收敛的速度。设计/方法论/方法在数据挖掘中,半监督学习是一种有趣的方法,因为使用一小部分标记数据可以获得良好的性能;其中一个原因是数据标注是昂贵的,而半监督学习不需要所有的标注。一种半监督学习是约束聚类;这种类型的学习不使用类标签进行聚类。相反,它使用一些实例对的信息(侧信息),这些实例可能在同一个集群中(必须链接[ML])或在不同的集群中(不能链接[CL])。对约束聚类进行了广泛的研究;然而,很少有研究关注大数据集的约束聚类。在本文中,作者提出了一种针对大数据集的约束聚类方法,该方法使用深度神经网络。作者将约束(ML和CL)注入到该DNN中以提高聚类性能,并将其称为约束深度嵌入聚类(CDEC)。通过这种方式,实现了一个自动编码器,以在潜在空间中提取信息丰富的低维特征,然后使用提出的Kullback-Leibler散度目标函数重新训练编码器网络,该目标函数捕获约束以对投影样本进行聚类。将所提出的CDEC与对抗性自编码器进行了比较,将约束1-谱聚类和自编码器+ k-means应用于已知的MNIST、Reuters-10k和USPS数据集,并从聚类精度方面评估了它们的性能。实证结果证实了CDEC在聚类精度方面的统计优势。首先,这是第一个在高维大数据集中使用边信息来提高聚类性能而不使用标签的dnn约束聚类。其次,定义了向DNN中注入侧信息的公式。第三,该方法提高了聚类性能和网络收敛速度。little的作品专注于大数据集的约束聚类;此外,对dnn进行聚类的研究,使用特定的损失函数同时提取特征和聚类数据,是很少的。该方法在不使用标签的情况下提高了大数据聚类的性能,这一点很重要,因为数据标记是昂贵且耗时的,特别是对于大数据集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信