Combining Semi-supervised Clustering and Classification Under a Generalized Framework

IF 1.9 4区计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Journal of Classification Pub Date : 2024-08-13 DOI:10.1007/s00357-024-09489-9

Zhen Jiang, Lingyun Zhao, Yu Lu

{"title":"Combining Semi-supervised Clustering and Classification Under a Generalized Framework","authors":"Zhen Jiang, Lingyun Zhao, Yu Lu","doi":"10.1007/s00357-024-09489-9","DOIUrl":null,"url":null,"abstract":"<p>Most machine learning algorithms rely on having a sufficient amount of labeled data to train a reliable classifier. However, labeling data is often costly and time-consuming, while unlabeled data can be readily accessible. Therefore, learning from both labeled and unlabeled data has become a hot topic of interest. Inspired by the co-training algorithm, we present a learning framework called CSCC, which combines semi-supervised clustering and classification to learn from both labeled and unlabeled data. Unlike existing co-training style methods that construct diverse classifiers to learn from each other, CSCC leverages the diversity between semi-supervised clustering and classification models to achieve mutual enhancement. Existing classification algorithms can be easily adapted to CSCC, allowing them to generalize from a few labeled data. Especially, in order to bridge the gap between class information and clustering, we propose a semi-supervised hierarchical clustering algorithm that utilizes labeled data to guide the process of cluster-splitting. Within the CSCC framework, we introduce two loss functions to supervise the iterative updating of the semi-supervised clustering and classification models, respectively. Extensive experiments conducted on a variety of benchmark datasets validate the superiority of CSCC over other state-of-the-art methods.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"13 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00357-024-09489-9","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Most machine learning algorithms rely on having a sufficient amount of labeled data to train a reliable classifier. However, labeling data is often costly and time-consuming, while unlabeled data can be readily accessible. Therefore, learning from both labeled and unlabeled data has become a hot topic of interest. Inspired by the co-training algorithm, we present a learning framework called CSCC, which combines semi-supervised clustering and classification to learn from both labeled and unlabeled data. Unlike existing co-training style methods that construct diverse classifiers to learn from each other, CSCC leverages the diversity between semi-supervised clustering and classification models to achieve mutual enhancement. Existing classification algorithms can be easily adapted to CSCC, allowing them to generalize from a few labeled data. Especially, in order to bridge the gap between class information and clustering, we propose a semi-supervised hierarchical clustering algorithm that utilizes labeled data to guide the process of cluster-splitting. Within the CSCC framework, we introduce two loss functions to supervise the iterative updating of the semi-supervised clustering and classification models, respectively. Extensive experiments conducted on a variety of benchmark datasets validate the superiority of CSCC over other state-of-the-art methods.

Abstract Image

查看原文本刊更多论文

通用框架下的半监督聚类与分类相结合

大多数机器学习算法都依赖于足够数量的标记数据来训练可靠的分类器。然而，标注数据通常既费钱又费时，而未标注数据却很容易获得。因此，从有标签和无标签数据中学习已成为人们关注的热点话题。受联合训练算法的启发，我们提出了一种名为 CSCC 的学习框架，它结合了半监督聚类和分类，可同时从有标签和无标签数据中学习。与现有的联合训练式方法构建不同的分类器来相互学习不同，CSCC 利用半监督聚类和分类模型之间的多样性来实现相互增强。现有的分类算法可以很容易地适应 CSCC，使其能够从少数标记数据中进行泛化。特别是，为了缩小类信息与聚类之间的差距，我们提出了一种半监督分层聚类算法，利用标记数据来指导分簇过程。在 CSCC 框架内，我们引入了两个损失函数，分别用于监督半监督聚类和分类模型的迭代更新。在各种基准数据集上进行的广泛实验验证了 CSCC 优于其他最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Classification 数学-数学跨学科应用

CiteScore

3.60

自引率

5.00%

发文量

审稿时长

>12 weeks

期刊介绍： To publish original and valuable papers in the field of classification, numerical taxonomy, multidimensional scaling and other ordination techniques, clustering, tree structures and other network models (with somewhat less emphasis on principal components analysis, factor analysis, and discriminant analysis), as well as associated models and algorithms for fitting them. Articles will support advances in methodology while demonstrating compelling substantive applications. Comprehensive review articles are also acceptable. Contributions will represent disciplines such as statistics, psychology, biology, information retrieval, anthropology, archeology, astronomy, business, chemistry, computer science, economics, engineering, geography, geology, linguistics, marketing, mathematics, medicine, political science, psychiatry, sociology, and soil science.