Miriama Jánošová , Andreas Lang , Petra Budikova , Erich Schubert , Vlastislav Dohnal
{"title":"On the evaluation and optimization of LabeledPAM","authors":"Miriama Jánošová , Andreas Lang , Petra Budikova , Erich Schubert , Vlastislav Dohnal","doi":"10.1016/j.is.2025.102580","DOIUrl":null,"url":null,"abstract":"<div><div>The analysis of complex and weakly labeled data is increasingly popular. Traditional unsupervised clustering aims to uncover interrelated sets of objects based on feature-based similarity. This approach often reaches its limits when dealing with complex multimedia data due to the curse of dimensionality, presenting unique challenges. Semi-supervised clustering, which leverages small amounts of labeled data, has the potential to cope with this problem.</div><div>In this work, we delve into LabeledPAM, a semi-supervised clustering method, which extends FasterPAM, a state-of-the-art <span><math><mi>k</mi></math></span>-medoids clustering algorithm. Our algorithm is designed for both semi-supervised classification, where labels are assigned to clusters with minimal labeled data, and semi-supervised clustering, where new clusters with unknown labels are identified. We propose an optimization to the original LabeledPAM algorithm that reduces its computational complexity. Additionally, we provide an implementation in Rust, which integrates seamlessly with Python libraries.</div><div>To assess LabeledPAM’s performance, we empirically evaluate its properties by comparing it against a range of semi-supervised clustering algorithms, including density-based ones. We conduct experiments on a collection of real-world datasets. Our results demonstrate that LabeledPAM achieves competitive clustering quality while maintaining efficiency across various scenarios, showing its versatility for real-world applications.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"135 ","pages":"Article 102580"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S030643792500064X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The analysis of complex and weakly labeled data is increasingly popular. Traditional unsupervised clustering aims to uncover interrelated sets of objects based on feature-based similarity. This approach often reaches its limits when dealing with complex multimedia data due to the curse of dimensionality, presenting unique challenges. Semi-supervised clustering, which leverages small amounts of labeled data, has the potential to cope with this problem.
In this work, we delve into LabeledPAM, a semi-supervised clustering method, which extends FasterPAM, a state-of-the-art -medoids clustering algorithm. Our algorithm is designed for both semi-supervised classification, where labels are assigned to clusters with minimal labeled data, and semi-supervised clustering, where new clusters with unknown labels are identified. We propose an optimization to the original LabeledPAM algorithm that reduces its computational complexity. Additionally, we provide an implementation in Rust, which integrates seamlessly with Python libraries.
To assess LabeledPAM’s performance, we empirically evaluate its properties by comparing it against a range of semi-supervised clustering algorithms, including density-based ones. We conduct experiments on a collection of real-world datasets. Our results demonstrate that LabeledPAM achieves competitive clustering quality while maintaining efficiency across various scenarios, showing its versatility for real-world applications.
期刊介绍:
Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems.
Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.