Journal of Classification最新文献

筛选
英文 中文
Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion 利用吉布斯采样器和信息标准对纵向数据进行聚类以建立生长曲线模型
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-06-19 DOI: 10.1007/s00357-024-09477-z
Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian
{"title":"Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion","authors":"Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian","doi":"10.1007/s00357-024-09477-z","DOIUrl":"https://doi.org/10.1007/s00357-024-09477-z","url":null,"abstract":"<p>Clustering longitudinal data for growth curve modelling is considered in this paper, where we aim to optimally estimate the underpinning unknown group partition matrix. Instead of following the conventional soft clustering approach, which assumes the columns of the partition matrix to have i.i.d. multinomial or categorical prior distributions and uses a regression model with the response following a finite mixture distribution to estimate the posterior distribution of the partition matrix, we propose an iterative partition and regression procedure to find the best partition matrix and the associated best growth curve regression model for each identified cluster. We show that the best partition matrix is the one minimizing a recently developed empirical Bayes information criterion (eBIC), which, due to the involved combinatorial explosion, is difficult to compute via enumerating all candidate partition matrices. Thus, we develop a Gibbs sampling method to generate a Markov chain of candidate partition matrices that has its equilibrium probability distribution equal the one induced from eBIC. We further show that the best partition matrix, given a priori the number of latent clusters, can be consistently estimated and is computationally scalable based on this Markov chain. The number of latent clusters is also best estimated by minimizing eBIC. The proposed iterative clustering and regression method is assessed by a comprehensive simulation study before being applied to two real-world growth curve modelling examples involving longitudinal data clustering.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"199 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141510498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finding Outliers in Gaussian Model-based Clustering 在基于高斯模型的聚类中查找异常值
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-05-30 DOI: 10.1007/s00357-024-09473-3
Katharine M. Clark, Paul D. McNicholas
{"title":"Finding Outliers in Gaussian Model-based Clustering","authors":"Katharine M. Clark, Paul D. McNicholas","doi":"10.1007/s00357-024-09473-3","DOIUrl":"https://doi.org/10.1007/s00357-024-09473-3","url":null,"abstract":"<p>Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and <i>post hoc</i> outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"13 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors 基于自适应距离动态聚类与 K 近邻协同作用的新型分类算法
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-05-11 DOI: 10.1007/s00357-024-09471-5
Mohammed Sabri, Rosanna Verde, Antonio Balzanella, Fabrizio Maturo, Hamid Tairi, Ali Yahyaouy, Jamal Riffi
{"title":"A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors","authors":"Mohammed Sabri, Rosanna Verde, Antonio Balzanella, Fabrizio Maturo, Hamid Tairi, Ali Yahyaouy, Jamal Riffi","doi":"10.1007/s00357-024-09471-5","DOIUrl":"https://doi.org/10.1007/s00357-024-09471-5","url":null,"abstract":"<p>This paper introduces a novel supervised classification method based on dynamic clustering (DC) and K-nearest neighbor (KNN) learning algorithms, denoted DC-KNN. The aim is to improve the accuracy of a classifier by using a DC method to discover the hidden patterns of the apriori groups of the training set. It provides a partitioning of each group into a predetermined number of subgroups. A new objective function is designed for the DC variant, based on a trade-off between the compactness and separation of all subgroups in the original groups. Moreover, the proposed DC method uses adaptive distances which assign a set of weights to the variables of each cluster, which depend on both their intra-cluster and inter-cluster structure. DC-KNN performs the minimization of a suitable objective function. Next, the KNN algorithm takes into account objects by assigning them to the label of subgroups. Furthermore, the classification step is performed according to two KNN competing algorithms. The proposed strategies have been evaluated using both synthetic data and widely used real datasets from public repositories. The achieved results have confirmed the effectiveness and robustness of the strategy in improving classification accuracy in comparison to alternative approaches.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"94 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerated Sequential Data Clustering 加速序列数据聚类
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-05-09 DOI: 10.1007/s00357-024-09472-4
Reza Mortazavi, Elham Enayati, Abdolali Basiri
{"title":"Accelerated Sequential Data Clustering","authors":"Reza Mortazavi, Elham Enayati, Abdolali Basiri","doi":"10.1007/s00357-024-09472-4","DOIUrl":"https://doi.org/10.1007/s00357-024-09472-4","url":null,"abstract":"<p>Data clustering is an important task in the field of data mining. In many real applications, clustering algorithms must consider the order of data, resulting in the problem of clustering sequential data. For instance, analyzing the moving pattern of an object and detecting community structure in a complex network are related to sequential data clustering. The constraint of the continuous region prevents previous clustering algorithms from being directly applied to the problem. A dynamic programming algorithm was proposed to address the issue, which returns the optimal sequential data clustering. However, it is not scalable and hence the practicality is limited. This paper revisits the solution and enhances it by introducing a greedy stopping condition. This condition halts the algorithm’s search process when it is likely that the optimal solution has been found. Experimental results on multiple datasets show that the algorithm is much faster than its original solution while the optimality gap is negligible.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"25 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Skew Multiple Scaled Mixtures of Normal Distributions with Flexible Tail Behavior and Their Application to Clustering 具有灵活尾部行为的偏斜多重标度正态分布混合物及其在聚类中的应用
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-05-06 DOI: 10.1007/s00357-024-09470-6
Abbas Mahdavi, Anthony F. Desmond, Ahad Jamalizadeh, Tsung-I Lin
{"title":"Skew Multiple Scaled Mixtures of Normal Distributions with Flexible Tail Behavior and Their Application to Clustering","authors":"Abbas Mahdavi, Anthony F. Desmond, Ahad Jamalizadeh, Tsung-I Lin","doi":"10.1007/s00357-024-09470-6","DOIUrl":"https://doi.org/10.1007/s00357-024-09470-6","url":null,"abstract":"<p>The family of multiple scaled mixtures of multivariate normal (MSMN) distributions has been shown to be a powerful tool for modeling data that allow different marginal amounts of tail weight. An extension of the MSMN distribution is proposed through the incorporation of a vector of shape parameters, resulting in the skew multiple scaled mixtures of multivariate normal (SMSMN) distributions. The family of SMSMN distributions can express a variety of shapes by controlling different degrees of tailedness and versatile skewness in each dimension. Some characterizations and probabilistic properties of the SMSMN distributions are studied and an extension to finite mixtures thereof is also discussed. Based on a sort of selection mechanism, a feasible ECME algorithm is designed to compute the maximum likelihood estimates of model parameters. Numerical experiments on simulated data and three real data examples demonstrate the efficacy and usefulness of the proposed methodology.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"111 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multinomial Restricted Unfolding 多项式受限展开
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-04-08 DOI: 10.1007/s00357-024-09465-3
Mark de Rooij, Frank Busing
{"title":"Multinomial Restricted Unfolding","authors":"Mark de Rooij, Frank Busing","doi":"10.1007/s00357-024-09465-3","DOIUrl":"https://doi.org/10.1007/s00357-024-09465-3","url":null,"abstract":"<p>For supervised classification we propose to use restricted multidimensional unfolding in a multinomial logistic framework. Where previous research proposed similar models based on squared distances, we propose to use usual (i.e., not squared) Euclidean distances. This change in functional form results in several interpretational advantages of the resulting biplot, a graphical representation of the classification model. First, the conditional probability of any class peaks at the location of the class in the Euclidean space. Second, the interpretation of the biplot is in terms of distances towards the class points, whereas in the squared distance model the interpretation is in terms of the distance towards the decision boundary. Third, the distance between two class points represents an upper bound for the estimated log-odds of choosing one of these classes over the other. For our multinomial restricted unfolding, we develop and test a Majorization Minimization algorithm that monotonically decreases the negative log-likelihood. With two empirical applications we point out the advantages of the distance model and show how to apply multinomial restricted unfolding in practice, including model selection.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inferential Tools for Assessing Dependence Across Response Categories in Multinomial Models with Discrete Random Effects 在具有离散随机效应的多项式模型中评估跨响应类别依赖性的推理工具
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-03-04 DOI: 10.1007/s00357-024-09466-2
{"title":"Inferential Tools for Assessing Dependence Across Response Categories in Multinomial Models with Discrete Random Effects","authors":"","doi":"10.1007/s00357-024-09466-2","DOIUrl":"https://doi.org/10.1007/s00357-024-09466-2","url":null,"abstract":"<h3>Abstract</h3> <p>We propose a discrete random effects multinomial regression model to deal with estimation and inference issues in the case of categorical and hierarchical data. Random effects are assumed to follow a discrete distribution with an a priori unknown number of support points. For a <em>K</em>-categories response, the modelling identifies a latent structure at the highest level of grouping, where groups are clustered into subpopulations. This model does not assume the independence across random effects relative to different response categories, and this provides an improvement from the multinomial semi-parametric multilevel model previously proposed in the literature. Since the category-specific random effects arise from the same subjects, the independence assumption is seldom verified in real data. To evaluate the improvements provided by the proposed model, we reproduce simulation and case studies of the literature, highlighting the strength of the method in properly modelling the real data structure and the advantages that taking into account the data dependence structure offers.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"62 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Binary Peacock Algorithm: A Novel Metaheuristic Approach for Feature Selection 二元孔雀算法:一种用于特征选择的新型元智方法
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-03-04 DOI: 10.1007/s00357-024-09468-0
Hema Banati, Richa Sharma, Asha Yadav
{"title":"Binary Peacock Algorithm: A Novel Metaheuristic Approach for Feature Selection","authors":"Hema Banati, Richa Sharma, Asha Yadav","doi":"10.1007/s00357-024-09468-0","DOIUrl":"https://doi.org/10.1007/s00357-024-09468-0","url":null,"abstract":"<p>Binary metaheuristic algorithms prove to be invaluable for solving binary optimization problems. This paper proposes a binary variant of the peacock algorithm (PA) for feature selection. PA, a recent metaheuristic algorithm, is built upon lekking and mating behaviors of peacocks and peahens. While designing the binary variant, two major shortcomings of PA (lek formation and offspring generation) were identified and addressed. Eight binary variants of PA are also proposed and compared over mean fitness to identify the best variant, called binary peacock algorithm (bPA). To validate bPA’s performance experiments are conducted using 34 benchmark datasets and results are compared with eight well-known binary metaheuristic algorithms. The results show that bPA classifies 30 datasets with highest accuracy and extracts minimum features in 32 datasets, achieving up to 99.80% reduction in the feature subset size in the dataset with maximum features. bPA attained rank 1 in Friedman rank test over all parameters.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"11 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data 高维相关数据的监督分类:基因组数据的应用
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-02-28 DOI: 10.1007/s00357-024-09463-5
Aboubacry Gaye, Abdou Ka Diongue, Seydou Nourou Sylla, Maryam Diarra, Amadou Diallo, Cheikh Talla, Cheikh Loucoubar
{"title":"Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data","authors":"Aboubacry Gaye, Abdou Ka Diongue, Seydou Nourou Sylla, Maryam Diarra, Amadou Diallo, Cheikh Talla, Cheikh Loucoubar","doi":"10.1007/s00357-024-09463-5","DOIUrl":"https://doi.org/10.1007/s00357-024-09463-5","url":null,"abstract":"<p>This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"6 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140011375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Soft Label Guided Unsupervised Discriminative Sparse Subspace Feature Selection 软标签引导的无监督判别稀疏子空间特征选择
IF 2 4区 计算机科学
Journal of Classification Pub Date : 2024-01-25 DOI: 10.1007/s00357-024-09462-6
Keding Chen, Yong Peng, Feiping Nie, Wanzeng Kong
{"title":"Soft Label Guided Unsupervised Discriminative Sparse Subspace Feature Selection","authors":"Keding Chen, Yong Peng, Feiping Nie, Wanzeng Kong","doi":"10.1007/s00357-024-09462-6","DOIUrl":"https://doi.org/10.1007/s00357-024-09462-6","url":null,"abstract":"<p>Feature selection and subspace learning are two primary methods to achieve data dimensionality reduction and discriminability enhancement. However, data label information is unavailable in unsupervised learning to guide the dimensionality reduction process. To this end, we propose a soft label guided unsupervised discriminative sparse subspace feature selection (UDS<span>(^2)</span>FS) model in this paper, which consists of two superiorities in comparison with the existing studies. On the one hand, UDS<span>(^2)</span>FS aims to find a discriminative subspace to simultaneously maximize the between-class data scatter and minimize the within-class scatter. On the other hand, UDS<span>(^2)</span>FS estimates the data label information in the learned subspace, which further serves as the soft labels to guide the discriminative subspace learning process. Moreover, the <span>(ell _{2,0})</span>-norm is imposed to achieve row sparsity of the subspace projection matrix, which is parameter-free and more stable compared to the <span>(ell _{2,1})</span>-norm. Experimental studies to evaluate the performance of UDS<span>(^2)</span>FS are performed from three aspects, i.e., a synthetic data set to check its iterative optimization process, several toy data sets to visualize the feature selection effect, and some benchmark data sets to examine the clustering performance of UDS<span>(^2)</span>FS. From the obtained results, UDS<span>(^2)</span>FS exhibits competitive performance in joint subspace learning and feature selection in comparison with some related models.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"330 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139559325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信