Exponentially Consistent Nonparametric Linkage-Based Clustering of Data Sequences

IF 5.8 2区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Signal Processing Pub Date : 2025-07-11 DOI:10.1109/TSP.2025.3588351

Bhupender Singh;Ananth Ram Rajagopalan;Srikrishna Bhashyam

{"title":"Exponentially Consistent Nonparametric Linkage-Based Clustering of Data Sequences","authors":"Bhupender Singh;Ananth Ram Rajagopalan;Srikrishna Bhashyam","doi":"10.1109/TSP.2025.3588351","DOIUrl":null,"url":null,"abstract":"In this paper, we consider nonparametric clustering of <inline-formula><tex-math>$M$</tex-math></inline-formula> independent and identically distributed (i.i.d.) data sequences generated from <italic>unknown</i> distributions. The distributions of the <inline-formula><tex-math>$M$</tex-math></inline-formula> data sequences belong to <inline-formula><tex-math>$K$</tex-math></inline-formula> underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and <inline-formula><tex-math>$k$</tex-math></inline-formula>-medoids distribution clustering, assume that the maximum intra-cluster distance (<inline-formula><tex-math>$d_{L}$</tex-math></inline-formula>) is smaller than the minimum inter-cluster distance (<inline-formula><tex-math>$d_{H}$</tex-math></inline-formula>). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, <inline-formula><tex-math>$d_{I} < d_{H}$</tex-math></inline-formula>, where <inline-formula><tex-math>$d_{I}$</tex-math></inline-formula> is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that <inline-formula><tex-math>$d_{I} < d_{L}$</tex-math></inline-formula> in general. Thus, our results show that SLINK is exponentially consistent for a larger class of problems than previously known. In our simulations, we also identify examples where <inline-formula><tex-math>$k$</tex-math></inline-formula>-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.","PeriodicalId":13330,"journal":{"name":"IEEE Transactions on Signal Processing","volume":"73 ","pages":"2819-2832"},"PeriodicalIF":5.8000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11078848/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we consider nonparametric clustering of

$M$

independent and identically distributed (i.i.d.) data sequences generated from unknown distributions. The distributions of the

$M$

data sequences belong to

$K$

underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and

$k$

-medoids distribution clustering, assume that the maximum intra-cluster distance (

$d_{L}$

) is smaller than the minimum inter-cluster distance (

$d_{H}$

). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption,

$d_{I} < d_{H}$

, where

$d_{I}$

is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that

$d_{I} < d_{L}$

in general. Thus, our results show that SLINK is exponentially consistent for a larger class of problems than previously known. In our simulations, we also identify examples where

$k$

-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.

查看原文本刊更多论文

基于指数一致非参数链接的数据序列聚类

本文研究了由未知分布生成的$M$独立同分布（i.i.d）数据序列的非参数聚类问题。$M$数据序列的分布属于$K$底层分布簇。现有的指数一致非参数聚类算法，如基于单链接的聚类和$k$-媒质分布聚类，都假设最大簇内距离（$d_{L}$）小于最小簇间距离（$d_{H}$）。首先，在固定样本量（FSS）设置下，我们证明了在较不严格的假设下，$d_{I} <；d_{H}$，其中$d_{I}$是对集群进行分区的集群的任意两个子集群之间的最大距离。注意$d_{I} <；d_{L}$一般。因此，我们的结果表明，对于比以前已知的更大的一类问题，SLINK是指数一致的。在我们的模拟中，我们还发现了一些例子，其中$k$- mediids聚类无法找到真正的聚类，但SLINK是指数一致的。然后，我们提出了一种基于SLINK的序列聚类算法SLINK- seq，并证明了它也是指数一致的。仿真结果表明，在相同的误差概率下，SLINK- seq算法所需的期望样本数比FSS SLINK算法少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Signal Processing 工程技术-工程：电子与电气

CiteScore

11.20

自引率

9.30%

发文量

310

审稿时长

3.0 months

期刊介绍： The IEEE Transactions on Signal Processing covers novel theory, algorithms, performance analyses and applications of techniques for the processing, understanding, learning, retrieval, mining, and extraction of information from signals. The term “signal” includes, among others, audio, video, speech, image, communication, geophysical, sonar, radar, medical and musical signals. Examples of topics of interest include, but are not limited to, information processing and the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals.