On the evaluation and optimization of LabeledPAM

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2025-07-22 DOI:10.1016/j.is.2025.102580

Miriama Jánošová , Andreas Lang , Petra Budikova , Erich Schubert , Vlastislav Dohnal

{"title":"On the evaluation and optimization of LabeledPAM","authors":"Miriama Jánošová , Andreas Lang , Petra Budikova , Erich Schubert , Vlastislav Dohnal","doi":"10.1016/j.is.2025.102580","DOIUrl":null,"url":null,"abstract":"<div><div>The analysis of complex and weakly labeled data is increasingly popular. Traditional unsupervised clustering aims to uncover interrelated sets of objects based on feature-based similarity. This approach often reaches its limits when dealing with complex multimedia data due to the curse of dimensionality, presenting unique challenges. Semi-supervised clustering, which leverages small amounts of labeled data, has the potential to cope with this problem.</div><div>In this work, we delve into LabeledPAM, a semi-supervised clustering method, which extends FasterPAM, a state-of-the-art <span><math><mi>k</mi></math></span>-medoids clustering algorithm. Our algorithm is designed for both semi-supervised classification, where labels are assigned to clusters with minimal labeled data, and semi-supervised clustering, where new clusters with unknown labels are identified. We propose an optimization to the original LabeledPAM algorithm that reduces its computational complexity. Additionally, we provide an implementation in Rust, which integrates seamlessly with Python libraries.</div><div>To assess LabeledPAM’s performance, we empirically evaluate its properties by comparing it against a range of semi-supervised clustering algorithms, including density-based ones. We conduct experiments on a collection of real-world datasets. Our results demonstrate that LabeledPAM achieves competitive clustering quality while maintaining efficiency across various scenarios, showing its versatility for real-world applications.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"135 ","pages":"Article 102580"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S030643792500064X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The analysis of complex and weakly labeled data is increasingly popular. Traditional unsupervised clustering aims to uncover interrelated sets of objects based on feature-based similarity. This approach often reaches its limits when dealing with complex multimedia data due to the curse of dimensionality, presenting unique challenges. Semi-supervised clustering, which leverages small amounts of labeled data, has the potential to cope with this problem.

In this work, we delve into LabeledPAM, a semi-supervised clustering method, which extends FasterPAM, a state-of-the-art

k

-medoids clustering algorithm. Our algorithm is designed for both semi-supervised classification, where labels are assigned to clusters with minimal labeled data, and semi-supervised clustering, where new clusters with unknown labels are identified. We propose an optimization to the original LabeledPAM algorithm that reduces its computational complexity. Additionally, we provide an implementation in Rust, which integrates seamlessly with Python libraries.

To assess LabeledPAM’s performance, we empirically evaluate its properties by comparing it against a range of semi-supervised clustering algorithms, including density-based ones. We conduct experiments on a collection of real-world datasets. Our results demonstrate that LabeledPAM achieves competitive clustering quality while maintaining efficiency across various scenarios, showing its versatility for real-world applications.

查看原文本刊更多论文

LabeledPAM的评价与优化

对复杂和弱标记数据的分析越来越流行。传统的无监督聚类旨在基于特征相似性发现相互关联的对象集。由于维数的限制，这种方法在处理复杂的多媒体数据时往往会达到极限，呈现出独特的挑战。利用少量标记数据的半监督聚类有可能解决这个问题。在这项工作中，我们深入研究了LabeledPAM，一种半监督聚类方法，它扩展了FasterPAM，一种最先进的k- medioids聚类算法。我们的算法是为半监督分类和半监督聚类设计的，前者将标签分配给具有最小标记数据的聚类，后者识别具有未知标签的新聚类。我们对原始的LabeledPAM算法进行了优化，降低了其计算复杂度。此外，我们还提供了一个Rust实现，它与Python库无缝集成。为了评估LabeledPAM的性能，我们通过将其与一系列半监督聚类算法（包括基于密度的算法）进行比较来经验地评估其性能。我们在真实世界的数据集上进行实验。我们的结果表明，LabeledPAM在保持各种场景的效率的同时，实现了具有竞争力的集群质量，显示了它在实际应用程序中的多功能性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.