Sparse and geometry-aware generalisation of the mutual information for joint discriminative clustering and feature selection

IF 1.6 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing Pub Date : 2024-07-17 DOI:10.1007/s11222-024-10467-9

Louis Ohl, Pierre-Alexandre Mattei, Charles Bouveyron, Mickaël Leclercq, Arnaud Droit, Frédéric Precioso

{"title":"Sparse and geometry-aware generalisation of the mutual information for joint discriminative clustering and feature selection","authors":"Louis Ohl, Pierre-Alexandre Mattei, Charles Bouveyron, Mickaël Leclercq, Arnaud Droit, Frédéric Precioso","doi":"10.1007/s11222-024-10467-9","DOIUrl":null,"url":null,"abstract":"<p>Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on the data distribution, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple <span>\\(\\ell _1\\)</span> penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model. We demonstrate the performances of Sparse GEMINI on synthetic datasets and large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"50 1","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics and Computing","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1007/s11222-024-10467-9","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on the data distribution, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple \(\ell _1\) penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model. We demonstrate the performances of Sparse GEMINI on synthetic datasets and large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.

Abstract Image

查看原文本刊更多论文

用于联合判别聚类和特征选择的互信息的稀疏和几何感知广义化

聚类中的特征选择是一项艰巨的任务，需要同时发现相关聚类以及与这些聚类相关的变量。特征选择算法通常基于模型，通过优化模型选择或对数据分布的强假设来实现，而我们引入了一种判别聚类模型，试图通过简单的（\ell _1\）惩罚来最大化互信息的几何感知广义化（称为 GEMINI）：稀疏 GEMINI。这种算法避免了组合特征子集探索的负担，可轻松扩展到高维数据和大量样本，同时只需设计一个判别聚类模型。我们在合成数据集和大规模数据集上演示了稀疏 GEMINI 的性能。我们的结果表明，稀疏 GEMINI 是一种有竞争力的算法，能够在不使用相关性标准或先验假设的情况下，选择与聚类相关的变量子集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistics and Computing 数学-计算机：理论方法

CiteScore

3.20

自引率

4.50%

发文量

审稿时长

6-12 weeks

期刊介绍： Statistics and Computing is a bi-monthly refereed journal which publishes papers covering the range of the interface between the statistical and computing sciences. In particular, it addresses the use of statistical concepts in computing science, for example in machine learning, computer vision and data analytics, as well as the use of computers in data modelling, prediction and analysis. Specific topics which are covered include: techniques for evaluating analytically intractable problems such as bootstrap resampling, Markov chain Monte Carlo, sequential Monte Carlo, approximate Bayesian computation, search and optimization methods, stochastic simulation and Monte Carlo, graphics, computer environments, statistical approaches to software errors, information retrieval, machine learning, statistics of databases and database technology, huge data sets and big data analytics, computer algebra, graphical models, image processing, tomography, inverse problems and uncertainty quantification. In addition, the journal contains original research reports, authoritative review papers, discussed papers, and occasional special issues on particular topics or carrying proceedings of relevant conferences. Statistics and Computing also publishes book review and software review sections.