A practical approach to novel class discovery in tabular data

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery Pub Date : 2024-05-31 DOI:10.1007/s10618-024-01025-y

Troisemaine Colin, Reiffers-Masson Alexandre, Gosselin Stéphane, Lemaire Vincent, Vaton Sandrine

{"title":"A practical approach to novel class discovery in tabular data","authors":"Troisemaine Colin, Reiffers-Masson Alexandre, Gosselin Stéphane, Lemaire Vincent, Vaton Sandrine","doi":"10.1007/s10618-024-01025-y","DOIUrl":null,"url":null,"abstract":"The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the k-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and shows robust performance under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (k-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"123 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01025-y","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the k-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and shows robust performance under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (k-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.

Abstract Image

查看原文本刊更多论文

在表格数据中发现新类别的实用方法

新类别发现（NCD）问题包括从已标注的已知类别集合中提取知识，以准确划分未标注的新类别集合。虽然 NCD 最近受到了社会各界的广泛关注，但它通常是在计算机视觉问题和不现实的条件下解决的。特别是，新类别的数量通常被假定为事先已知，其标签有时被用于调整超参数。依赖于这些假设的方法并不适用于现实世界中的场景。在这项工作中，我们将重点关注在没有关于新类别的先验知识的情况下，如何解决表格数据中的 NCD 问题。为此，我们建议调整 NCD 方法的超参数，方法是调整 k 折交叉验证过程，并在每个折中隐藏一些已知类别。由于我们发现超参数过多的方法很可能会过度拟合这些隐藏类别，因此我们定义了一个简单的深度 NCD 模型。该方法仅由 NCD 问题所需的基本要素组成，并在现实条件下表现出稳健的性能。此外，我们发现该方法的潜在空间可用于可靠地估计新类别的数量。此外，我们还调整了两种无监督聚类算法（k-means 和光谱聚类），以充分利用已知类别的知识。我们在 7 个表格数据集上进行了广泛的实验，证明了所提方法和超参数调整过程的有效性，并表明无需依赖新类别知识也能解决 NCD 问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Mining and Knowledge Discovery 工程技术-计算机：人工智能

CiteScore

10.40

自引率

4.20%

发文量

审稿时长

10 months

期刊介绍： Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.