Parameterized Complexity of Feature Selection for Categorical Data Clustering

IF 0.8 Q3 COMPUTER SCIENCE, THEORY & METHODS
Sayan Bandyapadhyay, F. Fomin, P. Golovach, Kirill Simonov
{"title":"Parameterized Complexity of Feature Selection for Categorical Data Clustering","authors":"Sayan Bandyapadhyay, F. Fomin, P. Golovach, Kirill Simonov","doi":"10.1145/3604797","DOIUrl":null,"url":null,"abstract":"We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers ℓ (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m − ℓ relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (ℓ0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and |Σ|. Our main result is an algorithm that solves the Feature Selection problem in time f(k, B, |Σ|) · mg(k, |Σ|) · n2 for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when |Σ| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering, and Binary and Boolean Low-rank Matrix Approximation with Outliers. Thus as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.","PeriodicalId":44045,"journal":{"name":"ACM Transactions on Computation Theory","volume":null,"pages":null},"PeriodicalIF":0.8000,"publicationDate":"2021-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computation Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3604797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 1

Abstract

We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers ℓ (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m − ℓ relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (ℓ0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and |Σ|. Our main result is an algorithm that solves the Feature Selection problem in time f(k, B, |Σ|) · mg(k, |Σ|) · n2 for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when |Σ| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering, and Binary and Boolean Low-rank Matrix Approximation with Outliers. Thus as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.
范畴数据聚类特征选择的参数化复杂性
我们开发了新的算法方法,具有可证明的保证,用于分类数据聚类的特征选择。虽然特征选择是实践中最常见的降维方法之一,但大多数已知的特征选择方法都是启发式的。我们研究下面的数学模型。我们假设输入数据中存在一些无意的(或不希望的)特征,这些特征会不必要地增加聚类的成本。因此,我们希望从数据中选择原始特征的一个子集,以便在所选特征上有一个小成本聚类。更准确地说,对于给定的整数,即n(不相关特征的数量)和k(聚类的数量),预算B和n个分类数据点的集合(由m维向量表示,其元素属于一个有限值集合Σ),我们想要选择m−n个相关特征,使得在这些特征上的任何最优k聚类的代价不超过b。这里,聚类的代价是聚类元素的选定特征与其中心之间的汉明距离(0-距离)的和。聚类成本是聚类成本的总和。我们使用参数化复杂性的框架来确定问题的复杂性如何依赖于参数k, B和|Σ|。我们的主要成果是在f(k, B, |Σ|)·mg(k, |Σ|)·n2时间内解决某些函数f和g的Feature Selection问题的算法。换句话说,当|Σ|和k为常数时,问题是固定参数可处理的,由B参数化。我们的特征选择算法是基于一个更普遍的问题的解决方案,即具有异常值的约束聚类。在这个问题中,我们想要删除一定数量的异常点,这样剩下的点就可以聚集在满足特定约束的中心周围。关于带离群点的约束聚类的一个有趣的事实是,除了特征选择之外,它还包含了许多关于分类数据的其他基本问题,如鲁棒聚类,带离群点的二进制和布尔低秩矩阵逼近。因此,作为我们定理的副产品,我们得到了所有这些问题的算法。我们还用复杂度下界来补充我们的算法发现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Transactions on Computation Theory
ACM Transactions on Computation Theory COMPUTER SCIENCE, THEORY & METHODS-
CiteScore
2.30
自引率
0.00%
发文量
10
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信