Parameterized Complexity of Feature Selection for Categorical Data Clustering

IF 0.8 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computation Theory Pub Date : 2021-05-08 DOI:10.1145/3604797

Sayan Bandyapadhyay, F. Fomin, P. Golovach, Kirill Simonov

{"title":"Parameterized Complexity of Feature Selection for Categorical Data Clustering","authors":"Sayan Bandyapadhyay, F. Fomin, P. Golovach, Kirill Simonov","doi":"10.1145/3604797","DOIUrl":null,"url":null,"abstract":"We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers ℓ (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m − ℓ relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (ℓ0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and |Σ|. Our main result is an algorithm that solves the Feature Selection problem in time f(k, B, |Σ|) · mg(k, |Σ|) · n2 for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when |Σ| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering, and Binary and Boolean Low-rank Matrix Approximation with Outliers. Thus as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.","PeriodicalId":44045,"journal":{"name":"ACM Transactions on Computation Theory","volume":"1 1","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2021-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computation Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3604797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 1

Abstract

We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers ℓ (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m − ℓ relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (ℓ0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and |Σ|. Our main result is an algorithm that solves the Feature Selection problem in time f(k, B, |Σ|) · mg(k, |Σ|) · n2 for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when |Σ| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering, and Binary and Boolean Low-rank Matrix Approximation with Outliers. Thus as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.

查看原文本刊更多论文

范畴数据聚类特征选择的参数化复杂性

我们开发了新的算法方法，具有可证明的保证，用于分类数据聚类的特征选择。虽然特征选择是实践中最常见的降维方法之一，但大多数已知的特征选择方法都是启发式的。我们研究下面的数学模型。我们假设输入数据中存在一些无意的(或不希望的)特征，这些特征会不必要地增加聚类的成本。因此，我们希望从数据中选择原始特征的一个子集，以便在所选特征上有一个小成本聚类。更准确地说，对于给定的整数，即n(不相关特征的数量)和k(聚类的数量)，预算B和n个分类数据点的集合(由m维向量表示，其元素属于一个有限值集合Σ)，我们想要选择m−n个相关特征，使得在这些特征上的任何最优k聚类的代价不超过b。这里，聚类的代价是聚类元素的选定特征与其中心之间的汉明距离(0-距离)的和。聚类成本是聚类成本的总和。我们使用参数化复杂性的框架来确定问题的复杂性如何依赖于参数k, B和|Σ|。我们的主要成果是在f(k, B， |Σ|)·mg(k， |Σ|)·n2时间内解决某些函数f和g的Feature Selection问题的算法。换句话说，当|Σ|和k为常数时，问题是固定参数可处理的，由B参数化。我们的特征选择算法是基于一个更普遍的问题的解决方案，即具有异常值的约束聚类。在这个问题中，我们想要删除一定数量的异常点，这样剩下的点就可以聚集在满足特定约束的中心周围。关于带离群点的约束聚类的一个有趣的事实是，除了特征选择之外，它还包含了许多关于分类数据的其他基本问题，如鲁棒聚类，带离群点的二进制和布尔低秩矩阵逼近。因此，作为我们定理的副产品，我们得到了所有这些问题的算法。我们还用复杂度下界来补充我们的算法发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Computation Theory COMPUTER SCIENCE, THEORY & METHODS-

CiteScore

2.30

自引率

0.00%

发文量