Dingge Liang, Marco Corneli, Charles Bouveyron, Pierre Latouche
{"title":"Clustering by deep latent position model with graph convolutional network","authors":"Dingge Liang, Marco Corneli, Charles Bouveyron, Pierre Latouche","doi":"10.1007/s11634-024-00583-9","DOIUrl":"https://doi.org/10.1007/s11634-024-00583-9","url":null,"abstract":"<p>With the significant increase of interactions between individuals through numeric means, clustering of nodes in graphs has become a fundamental approach for analyzing large and complex networks. In this work, we propose the deep latent position model (DeepLPM), an end-to-end generative clustering approach which combines the widely used latent position model (LPM) for network analysis with a graph convolutional network encoding strategy. Moreover, an original estimation algorithm is introduced to integrate the explicit optimization of the posterior clustering probabilities via variational inference and the implicit optimization using stochastic gradient descent for graph reconstruction. Numerical experiments on simulated scenarios highlight the ability of DeepLPM to self-penalize the evidence lower bound for selecting the number of clusters, demonstrating its clustering capabilities compared to state-of-the-art methods. Finally, DeepLPM is further applied to an ecclesiastical network in Merovingian Gaul and to a citation network Cora to illustrate the practical interest in exploring large and complex real-world networks.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"35 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu
{"title":"Choosing the number of factors in factor analysis with incomplete data via a novel hierarchical Bayesian information criterion","authors":"Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu","doi":"10.1007/s11634-024-00582-w","DOIUrl":"https://doi.org/10.1007/s11634-024-00582-w","url":null,"abstract":"<p>The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size <i>N</i>, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the ‘complete’ sample size <i>N</i> is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only <span>(N_i<N)</span> observations for variable <i>i</i>, which means that using the ‘complete’ sample size <i>N</i> implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel hierarchical BIC (HBIC) criterion is proposed for factor analysis with incomplete data, which is denoted by HBIC<sub>inc</sub>. The novelty is that HBIC<sub>inc</sub> only uses the actual amounts of observed information, namely <span>(N_i)</span>’s, in the penalty term. Theoretically, it is shown that HBIC<sub>inc</sub> is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBIC<sub>inc</sub>, which means that HBIC<sub>inc</sub> shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBIC<sub>inc</sub>, BIC, and related criteria with various missing rates. The results show that HBIC<sub>inc</sub> and BIC perform similarly when the missing rate is small, but HBIC<sub>inc</sub> is more accurate when the missing rate is not small.\u0000</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"92 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140057185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimators of various kappa coefficients based on the unbiased estimator of the expected index of agreements","authors":"A. Martín Andrés, M. Álvarez Hernández","doi":"10.1007/s11634-024-00581-x","DOIUrl":"https://doi.org/10.1007/s11634-024-00581-x","url":null,"abstract":"<p>To measure the degree of agreement between <i>R</i> observers who independently classify <i>n</i> subjects within <i>K</i> categories, various <i>kappa</i>-type coefficients are often used. When <i>R</i> = 2, it is common to use the Cohen' <i>kappa</i>, Scott's <i>pi</i>, Gwet’s <i>AC1/2</i>, and Krippendorf's <i>alpha</i> coefficients (weighted or not). When <i>R</i> > 2, some pairwise version based on the aforementioned coefficients is normally used; with the same order as above: Hubert's <i>kappa</i>, Fleiss's <i>kappa</i>, Gwet's <i>AC1/2,</i> and Krippendorf's <i>alpha</i>. However, all these statistics are based on biased estimators of the expected index of agreements, since they estimate the product of two population proportions through the product of their sample estimators. The aims of this article are three. First, to provide statistics based on unbiased estimators of the expected index of agreements and determine their variance based on the variance of the original statistic. Second, to make pairwise extensions of some measures. And third, to show that the old and new estimators of the Cohen’s <i>kappa</i> and Hubert’s <i>kappa</i> coefficients match the well-known estimators of concordance and intraclass correlation coefficients, if the former are defined by assuming quadratic weights. The article shows that the new estimators are always greater than or equal the classic ones, except for the case of Gwet where it is the other way around, although these differences are only relevant with small sample sizes (e.g. <i>n</i> ≤ 30).</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"57 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis-Angel García-Escudero, Salvatore Ingrassia, T. Brendan Murphy
{"title":"Special issue on “advances in models and learning for clustering and classification”","authors":"Luis-Angel García-Escudero, Salvatore Ingrassia, T. Brendan Murphy","doi":"10.1007/s11634-024-00584-8","DOIUrl":"10.1007/s11634-024-00584-8","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"1 - 4"},"PeriodicalIF":1.4,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142414305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatial quantile clustering of climate data","authors":"Carlo Gaetan, Paolo Girardi, Victor Muthama Musau","doi":"10.1007/s11634-024-00580-y","DOIUrl":"https://doi.org/10.1007/s11634-024-00580-y","url":null,"abstract":"<p>In the era of climate change, the distribution of climate variables evolves with changes not limited to the mean value. Consequently, clustering algorithms based on central tendency could produce misleading results when used to summarize spatial and/or temporal patterns. We present a novel approach to spatial clustering of time series based on quantiles using a Bayesian framework that incorporates a spatial dependence layer based on a Markov random field. A series of simulations tested the proposal, then applied to the sea surface temperature of the Mediterranean Sea, one of the first seas to be affected by the effects of climate change.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"198 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139946363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Berkay Akturk, Ufuk Beyaztas, Han Lin Shang, Abhijit Mandal
{"title":"Robust functional logistic regression","authors":"Berkay Akturk, Ufuk Beyaztas, Han Lin Shang, Abhijit Mandal","doi":"10.1007/s11634-023-00577-z","DOIUrl":"https://doi.org/10.1007/s11634-023-00577-z","url":null,"abstract":"<p>Functional logistic regression is a popular model to capture a linear relationship between binary response and functional predictor variables. However, many methods used for parameter estimation in functional logistic regression are sensitive to outliers, which may lead to inaccurate parameter estimates and inferior classification accuracy. We propose a robust estimation procedure for functional logistic regression, in which the observations of the functional predictor are projected onto a set of finite-dimensional subspaces via robust functional principal component analysis. This dimension-reduction step reduces the outlying effects in the functional predictor. The logistic regression coefficient is estimated using an M-type estimator based on binary response and robust principal component scores. In doing so, we provide robust estimates by minimizing the effects of outliers in the binary response and functional predictor variables. Via a series of Monte-Carlo simulations and using hand radiograph data, we examine the parameter estimation and classification accuracy for the response variable. We find that the robust procedure outperforms some existing robust and non-robust methods when outliers are present, while producing competitive results when outliers are absent. In addition, the proposed method is computationally more efficient than some existing robust alternatives.\u0000</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"2018 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139771456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural networks with functional inputs for multi-class supervised classification of replicated point patterns","authors":"Kateřina Pawlasová, Iva Karafiátová, Jiří Dvořák","doi":"10.1007/s11634-024-00579-5","DOIUrl":"10.1007/s11634-024-00579-5","url":null,"abstract":"<div><p>A spatial point pattern is a collection of points observed in a bounded region of the Euclidean plane or space. With the dynamic development of modern imaging methods, large datasets of point patterns are available representing for example sub-cellular location patterns for human proteins or large forest populations. The main goal of this paper is to show the possibility of solving the supervised multi-class classification task for this particular type of complex data via functional neural networks. To predict the class membership for a newly observed point pattern, we compute an empirical estimate of a selected functional characteristic. Then, we consider such estimated function to be a functional variable entering the network. In a simulation study, we show that the neural network approach outperforms the kernel regression classifier that we consider a benchmark method in the point pattern setting. We also analyse a real dataset of point patterns of intramembranous particles and illustrate the practical applicability of the proposed method.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"705 - 721"},"PeriodicalIF":1.4,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00579-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139771644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"k-means clustering for persistent homology","authors":"Yueqi Cao, Prudence Leung, Anthea Monod","doi":"10.1007/s11634-023-00578-y","DOIUrl":"https://doi.org/10.1007/s11634-023-00578-y","url":null,"abstract":"<p>Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram. It has recently gained much popularity from its myriad successful applications to many domains, however, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the <i>k</i>-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush–Kuhn–Tucker framework. Additionally, we perform numerical experiments on both simulated and real data of various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures. We find that <i>k</i>-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"77 4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139644821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RGA: a unified measure of predictive accuracy","authors":"Paolo Giudici, Emanuela Raffinetti","doi":"10.1007/s11634-023-00574-2","DOIUrl":"https://doi.org/10.1007/s11634-023-00574-2","url":null,"abstract":"<p>A key point to assess statistical forecasts is the evaluation of their predictive accuracy. Recently, a new measure, called Rank Graduation Accuracy (RGA), based on the concordance between the ranks of the predicted values and the ranks of the actual values of a series of observations to be forecast, was proposed for the assessment of the quality of the predictions. In this paper, we demonstrate that, in a classification perspective, when the response to be predicted is binary, the RGA coincides both with the AUROC and the Wilcoxon-Mann–Whitney statistic, and can be employed to evaluate the accuracy of probability forecasts. When the response to be predicted is real valued, the RGA can still be applied, differently from the AUROC, and similarly to measures such as the RMSE. Differently from the RMSE, the RGA measure evaluates point predictions in terms of their ranks, rather than in terms of their values, improving robustness.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139481072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QDA classification of high-dimensional data with rare and weak signals","authors":"Hanning Chen, Qiang Zhao, Jingjing Wu","doi":"10.1007/s11634-023-00576-0","DOIUrl":"https://doi.org/10.1007/s11634-023-00576-0","url":null,"abstract":"<p>This paper addresses the two-class classification problem for data with rare and weak signals, under the modern high-dimension setup <span>(p>>n)</span>. Considering the two-component mixture of Gaussian features with different random mean vector of rare and weak signals but common covariance matrix (homoscedastic Gaussian), Fan (AS 41:2537-2571, 2013) investigated the optimality of linear discriminant analysis (LDA) and proposed an efficient variable selection and classification procedure. We extend their work by incorporating the more general scenario that the two components have different random covariance matrices with difference of rare and weak signals, in order to assess the effect of difference in covariance matrix on classification. Under this model, we investigated the behaviour of quadratic discriminant analysis (QDA) classifier. In theoretical aspect, we derived the successful and unsuccessful classification regions of QDA. For data of rare signals, variable selection will mostly improve the performance of statistical procedures. Thus in implementation aspect, we proposed a variable selection procedure for QDA based on the Higher Criticism Thresholding (HCT) that was proved efficient for LDA. In addition, we conducted extensive simulation studies to demonstrate the successful and unsuccessful classification regions of QDA and evaluate the effectiveness of the proposed HCT thresholded QDA.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"72 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138745929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}