{"title":"Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion","authors":"Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian","doi":"10.1007/s00357-024-09477-z","DOIUrl":"https://doi.org/10.1007/s00357-024-09477-z","url":null,"abstract":"<p>Clustering longitudinal data for growth curve modelling is considered in this paper, where we aim to optimally estimate the underpinning unknown group partition matrix. Instead of following the conventional soft clustering approach, which assumes the columns of the partition matrix to have i.i.d. multinomial or categorical prior distributions and uses a regression model with the response following a finite mixture distribution to estimate the posterior distribution of the partition matrix, we propose an iterative partition and regression procedure to find the best partition matrix and the associated best growth curve regression model for each identified cluster. We show that the best partition matrix is the one minimizing a recently developed empirical Bayes information criterion (eBIC), which, due to the involved combinatorial explosion, is difficult to compute via enumerating all candidate partition matrices. Thus, we develop a Gibbs sampling method to generate a Markov chain of candidate partition matrices that has its equilibrium probability distribution equal the one induced from eBIC. We further show that the best partition matrix, given a priori the number of latent clusters, can be consistently estimated and is computationally scalable based on this Markov chain. The number of latent clusters is also best estimated by minimizing eBIC. The proposed iterative clustering and regression method is assessed by a comprehensive simulation study before being applied to two real-world growth curve modelling examples involving longitudinal data clustering.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"199 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Finding Outliers in Gaussian Model-based Clustering","authors":"Katharine M. Clark, Paul D. McNicholas","doi":"10.1007/s00357-024-09473-3","DOIUrl":"https://doi.org/10.1007/s00357-024-09473-3","url":null,"abstract":"<p>Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and <i>post hoc</i> outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"13 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammed Sabri, Rosanna Verde, Antonio Balzanella, Fabrizio Maturo, Hamid Tairi, Ali Yahyaouy, Jamal Riffi
{"title":"A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors","authors":"Mohammed Sabri, Rosanna Verde, Antonio Balzanella, Fabrizio Maturo, Hamid Tairi, Ali Yahyaouy, Jamal Riffi","doi":"10.1007/s00357-024-09471-5","DOIUrl":"https://doi.org/10.1007/s00357-024-09471-5","url":null,"abstract":"<p>This paper introduces a novel supervised classification method based on dynamic clustering (DC) and K-nearest neighbor (KNN) learning algorithms, denoted DC-KNN. The aim is to improve the accuracy of a classifier by using a DC method to discover the hidden patterns of the apriori groups of the training set. It provides a partitioning of each group into a predetermined number of subgroups. A new objective function is designed for the DC variant, based on a trade-off between the compactness and separation of all subgroups in the original groups. Moreover, the proposed DC method uses adaptive distances which assign a set of weights to the variables of each cluster, which depend on both their intra-cluster and inter-cluster structure. DC-KNN performs the minimization of a suitable objective function. Next, the KNN algorithm takes into account objects by assigning them to the label of subgroups. Furthermore, the classification step is performed according to two KNN competing algorithms. The proposed strategies have been evaluated using both synthetic data and widely used real datasets from public repositories. The achieved results have confirmed the effectiveness and robustness of the strategy in improving classification accuracy in comparison to alternative approaches.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"94 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerated Sequential Data Clustering","authors":"Reza Mortazavi, Elham Enayati, Abdolali Basiri","doi":"10.1007/s00357-024-09472-4","DOIUrl":"https://doi.org/10.1007/s00357-024-09472-4","url":null,"abstract":"<p>Data clustering is an important task in the field of data mining. In many real applications, clustering algorithms must consider the order of data, resulting in the problem of clustering sequential data. For instance, analyzing the moving pattern of an object and detecting community structure in a complex network are related to sequential data clustering. The constraint of the continuous region prevents previous clustering algorithms from being directly applied to the problem. A dynamic programming algorithm was proposed to address the issue, which returns the optimal sequential data clustering. However, it is not scalable and hence the practicality is limited. This paper revisits the solution and enhances it by introducing a greedy stopping condition. This condition halts the algorithm’s search process when it is likely that the optimal solution has been found. Experimental results on multiple datasets show that the algorithm is much faster than its original solution while the optimality gap is negligible.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"25 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abbas Mahdavi, Anthony F. Desmond, Ahad Jamalizadeh, Tsung-I Lin
{"title":"Skew Multiple Scaled Mixtures of Normal Distributions with Flexible Tail Behavior and Their Application to Clustering","authors":"Abbas Mahdavi, Anthony F. Desmond, Ahad Jamalizadeh, Tsung-I Lin","doi":"10.1007/s00357-024-09470-6","DOIUrl":"https://doi.org/10.1007/s00357-024-09470-6","url":null,"abstract":"<p>The family of multiple scaled mixtures of multivariate normal (MSMN) distributions has been shown to be a powerful tool for modeling data that allow different marginal amounts of tail weight. An extension of the MSMN distribution is proposed through the incorporation of a vector of shape parameters, resulting in the skew multiple scaled mixtures of multivariate normal (SMSMN) distributions. The family of SMSMN distributions can express a variety of shapes by controlling different degrees of tailedness and versatile skewness in each dimension. Some characterizations and probabilistic properties of the SMSMN distributions are studied and an extension to finite mixtures thereof is also discussed. Based on a sort of selection mechanism, a feasible ECME algorithm is designed to compute the maximum likelihood estimates of model parameters. Numerical experiments on simulated data and three real data examples demonstrate the efficacy and usefulness of the proposed methodology.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"111 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multinomial Restricted Unfolding","authors":"Mark de Rooij, Frank Busing","doi":"10.1007/s00357-024-09465-3","DOIUrl":"https://doi.org/10.1007/s00357-024-09465-3","url":null,"abstract":"<p>For supervised classification we propose to use restricted multidimensional unfolding in a multinomial logistic framework. Where previous research proposed similar models based on squared distances, we propose to use usual (i.e., not squared) Euclidean distances. This change in functional form results in several interpretational advantages of the resulting biplot, a graphical representation of the classification model. First, the conditional probability of any class peaks at the location of the class in the Euclidean space. Second, the interpretation of the biplot is in terms of distances towards the class points, whereas in the squared distance model the interpretation is in terms of the distance towards the decision boundary. Third, the distance between two class points represents an upper bound for the estimated log-odds of choosing one of these classes over the other. For our multinomial restricted unfolding, we develop and test a Majorization Minimization algorithm that monotonically decreases the negative log-likelihood. With two empirical applications we point out the advantages of the distance model and show how to apply multinomial restricted unfolding in practice, including model selection.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inferential Tools for Assessing Dependence Across Response Categories in Multinomial Models with Discrete Random Effects","authors":"","doi":"10.1007/s00357-024-09466-2","DOIUrl":"https://doi.org/10.1007/s00357-024-09466-2","url":null,"abstract":"<h3>Abstract</h3> <p>We propose a discrete random effects multinomial regression model to deal with estimation and inference issues in the case of categorical and hierarchical data. Random effects are assumed to follow a discrete distribution with an a priori unknown number of support points. For a <em>K</em>-categories response, the modelling identifies a latent structure at the highest level of grouping, where groups are clustered into subpopulations. This model does not assume the independence across random effects relative to different response categories, and this provides an improvement from the multinomial semi-parametric multilevel model previously proposed in the literature. Since the category-specific random effects arise from the same subjects, the independence assumption is seldom verified in real data. To evaluate the improvements provided by the proposed model, we reproduce simulation and case studies of the literature, highlighting the strength of the method in properly modelling the real data structure and the advantages that taking into account the data dependence structure offers.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"62 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Binary Peacock Algorithm: A Novel Metaheuristic Approach for Feature Selection","authors":"Hema Banati, Richa Sharma, Asha Yadav","doi":"10.1007/s00357-024-09468-0","DOIUrl":"https://doi.org/10.1007/s00357-024-09468-0","url":null,"abstract":"<p>Binary metaheuristic algorithms prove to be invaluable for solving binary optimization problems. This paper proposes a binary variant of the peacock algorithm (PA) for feature selection. PA, a recent metaheuristic algorithm, is built upon lekking and mating behaviors of peacocks and peahens. While designing the binary variant, two major shortcomings of PA (lek formation and offspring generation) were identified and addressed. Eight binary variants of PA are also proposed and compared over mean fitness to identify the best variant, called binary peacock algorithm (bPA). To validate bPA’s performance experiments are conducted using 34 benchmark datasets and results are compared with eight well-known binary metaheuristic algorithms. The results show that bPA classifies 30 datasets with highest accuracy and extracts minimum features in 32 datasets, achieving up to 99.80% reduction in the feature subset size in the dataset with maximum features. bPA attained rank 1 in Friedman rank test over all parameters.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"11 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data","authors":"Aboubacry Gaye, Abdou Ka Diongue, Seydou Nourou Sylla, Maryam Diarra, Amadou Diallo, Cheikh Talla, Cheikh Loucoubar","doi":"10.1007/s00357-024-09463-5","DOIUrl":"https://doi.org/10.1007/s00357-024-09463-5","url":null,"abstract":"<p>This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"6 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140011375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Soft Label Guided Unsupervised Discriminative Sparse Subspace Feature Selection","authors":"Keding Chen, Yong Peng, Feiping Nie, Wanzeng Kong","doi":"10.1007/s00357-024-09462-6","DOIUrl":"https://doi.org/10.1007/s00357-024-09462-6","url":null,"abstract":"<p>Feature selection and subspace learning are two primary methods to achieve data dimensionality reduction and discriminability enhancement. However, data label information is unavailable in unsupervised learning to guide the dimensionality reduction process. To this end, we propose a soft label guided unsupervised discriminative sparse subspace feature selection (UDS<span>(^2)</span>FS) model in this paper, which consists of two superiorities in comparison with the existing studies. On the one hand, UDS<span>(^2)</span>FS aims to find a discriminative subspace to simultaneously maximize the between-class data scatter and minimize the within-class scatter. On the other hand, UDS<span>(^2)</span>FS estimates the data label information in the learned subspace, which further serves as the soft labels to guide the discriminative subspace learning process. Moreover, the <span>(ell _{2,0})</span>-norm is imposed to achieve row sparsity of the subspace projection matrix, which is parameter-free and more stable compared to the <span>(ell _{2,1})</span>-norm. Experimental studies to evaluate the performance of UDS<span>(^2)</span>FS are performed from three aspects, i.e., a synthetic data set to check its iterative optimization process, several toy data sets to visualize the feature selection effect, and some benchmark data sets to examine the clustering performance of UDS<span>(^2)</span>FS. From the obtained results, UDS<span>(^2)</span>FS exhibits competitive performance in joint subspace learning and feature selection in comparison with some related models.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"330 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139559325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}