{"title":"Exponentially Consistent Nonparametric Linkage-Based Clustering of Data Sequences","authors":"Bhupender Singh;Ananth Ram Rajagopalan;Srikrishna Bhashyam","doi":"10.1109/TSP.2025.3588351","DOIUrl":null,"url":null,"abstract":"In this paper, we consider nonparametric clustering of <inline-formula><tex-math>$M$</tex-math></inline-formula> independent and identically distributed (i.i.d.) data sequences generated from <italic>unknown</i> distributions. The distributions of the <inline-formula><tex-math>$M$</tex-math></inline-formula> data sequences belong to <inline-formula><tex-math>$K$</tex-math></inline-formula> underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and <inline-formula><tex-math>$k$</tex-math></inline-formula>-medoids distribution clustering, assume that the maximum intra-cluster distance (<inline-formula><tex-math>$d_{L}$</tex-math></inline-formula>) is smaller than the minimum inter-cluster distance (<inline-formula><tex-math>$d_{H}$</tex-math></inline-formula>). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, <inline-formula><tex-math>$d_{I} < d_{H}$</tex-math></inline-formula>, where <inline-formula><tex-math>$d_{I}$</tex-math></inline-formula> is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that <inline-formula><tex-math>$d_{I} < d_{L}$</tex-math></inline-formula> in general. Thus, our results show that SLINK is exponentially consistent for a larger class of problems than previously known. In our simulations, we also identify examples where <inline-formula><tex-math>$k$</tex-math></inline-formula>-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.","PeriodicalId":13330,"journal":{"name":"IEEE Transactions on Signal Processing","volume":"73 ","pages":"2819-2832"},"PeriodicalIF":5.8000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11078848/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data sequences generated from unknown distributions. The distributions of the $M$ data sequences belong to $K$ underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and $k$-medoids distribution clustering, assume that the maximum intra-cluster distance ($d_{L}$) is smaller than the minimum inter-cluster distance ($d_{H}$). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, $d_{I} < d_{H}$, where $d_{I}$ is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that $d_{I} < d_{L}$ in general. Thus, our results show that SLINK is exponentially consistent for a larger class of problems than previously known. In our simulations, we also identify examples where $k$-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.
期刊介绍:
The IEEE Transactions on Signal Processing covers novel theory, algorithms, performance analyses and applications of techniques for the processing, understanding, learning, retrieval, mining, and extraction of information from signals. The term “signal” includes, among others, audio, video, speech, image, communication, geophysical, sonar, radar, medical and musical signals. Examples of topics of interest include, but are not limited to, information processing and the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals.