{"title":"Linear covariance selection model via ℓ1-penalization","authors":"Kwan-Young Bak , Seongoh Park","doi":"10.1016/j.csda.2025.108176","DOIUrl":"10.1016/j.csda.2025.108176","url":null,"abstract":"<div><div>This paper presents a study on an <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span>-penalized covariance regression method. Conventional approaches in high-dimensional covariance estimation often lack the flexibility to integrate external information. As a remedy, we adopt the regression-based covariance modeling framework and introduce a linear covariance selection model (LCSM) to encompass a broader spectrum of covariance structures when covariate information is available. Unlike existing methods, we do not assume that the true covariance matrix can be exactly represented by a linear combination of known basis matrices. Instead, we adopt additional basis matrices for a portion of the covariance patterns not captured by the given bases. To estimate high-dimensional regression coefficients, we exploit the sparsity-inducing <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span>-penalization scheme. Our theoretical analyses are based on the (symmetric) matrix regression model with additive random error matrix, which allows us to establish new non-asymptotic convergence rates of the proposed covariance estimator. The proposed method is implemented with the coordinate descent algorithm. We conduct empirical evaluation on simulated data to complement theoretical findings and underscore the efficacy of our approach. To show a practical applicability of our method, we further apply it to the co-expression analysis of liver gene expression data where the given basis corresponds to the adjacency matrix of the co-expression network.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108176"},"PeriodicalIF":1.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143725905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A deflation-adjusted Bayesian information criterion for selecting the number of clusters in K-means clustering","authors":"Masao Ueki","doi":"10.1016/j.csda.2025.108170","DOIUrl":"10.1016/j.csda.2025.108170","url":null,"abstract":"<div><div>A deflation-adjusted Bayesian information criterion is proposed by introducing a closed-form adjustment to the variance estimate after K-means clustering. An expected lower bound of the deflation in the variance estimate after K-means clustering is derived and used as an adjustment factor for the variance estimates. The deflation-adjusted variance estimates are applied to the Bayesian information criterion under the Gaussian model for selecting the number of clusters. The closed-form expression makes the proposed deflation-adjusted Bayesian information criterion computationally efficient. Simulation studies show that the deflation-adjusted Bayesian information criterion performs better than other existing clustering methods in some situations, including K-means clustering with the number of clusters selected by standard Bayesian information criteria, the gap statistic, the average silhouette score, the prediction strength, and clustering using a Gaussian mixture model with the Bayesian information criterion. The proposed method is illustrated through a real data application for clustering human genomic data from the 1000 Genomes Project.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108170"},"PeriodicalIF":1.5,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yichen Lou , Yuqing Ma , Liming Xiang , Jianguo Sun
{"title":"A multiple imputation approach for flexible modelling of interval-censored data with missing and censored covariates","authors":"Yichen Lou , Yuqing Ma , Liming Xiang , Jianguo Sun","doi":"10.1016/j.csda.2025.108177","DOIUrl":"10.1016/j.csda.2025.108177","url":null,"abstract":"<div><div>This paper discusses regression analysis of interval-censored failure time data that commonly occur in biomedical studies among others. For the situation, the failure event of interest is known only to occur within an interval instead of being observed exactly. In addition to interval censoring on the failure time of interest, sometimes covariates may be missing or suffer censoring, which can bring extra theoretical and computational challenges for the regression analysis. To deal with such data, we propose a novel multiple imputation approach with the use of the rejection sampling under a class of semiparametric transformation models. The proposed method is flexible and can lead to more efficient estimation than the existing methods, and the resulting estimators are shown to be consistent and asymptotically normal. An extensive simulation study is conducted and demonstrates that the proposed approach works well in practice. Finally, we apply the proposed approach to a set of real data on Alzheimer's disease that motivated this study.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108177"},"PeriodicalIF":1.5,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143714600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ju-Chi Yu , Julie Le Borgne , Anjali Krishnan , Arnaud Gloaguen , Cheng-Ta Yang , Laura A. Rabin , Hervé Abdi , Vincent Guillemot
{"title":"Sparse factor analysis for categorical data with the group-sparse generalized singular value decomposition","authors":"Ju-Chi Yu , Julie Le Borgne , Anjali Krishnan , Arnaud Gloaguen , Cheng-Ta Yang , Laura A. Rabin , Hervé Abdi , Vincent Guillemot","doi":"10.1016/j.csda.2025.108179","DOIUrl":"10.1016/j.csda.2025.108179","url":null,"abstract":"<div><div>Correspondence analysis, multiple correspondence analysis, and their discriminant counterparts (i.e., discriminant simple correspondence analysis and discriminant multiple correspondence analysis) are methods of choice for analyzing multivariate categorical data. In these methods, variables are integrated into optimal components computed as linear combinations whose weights are obtained from a generalized singular value decomposition (GSVD) that integrates specific metric constraints on the rows and columns of the original data matrix. The weights of the linear combinations are, in turn, used to interpret the components, and this interpretation is facilitated when components are 1) pairwise orthogonal and 2) when the values of the weights are either large or small but not intermediate—a configuration called a simple or a sparse structure. To obtain such simple configurations, the optimization problem solved by the GSVD is extended to include new constraints that implement component orthogonality and sparse weights. Because multiple correspondence analysis represents qualitative variables by a set of binary columns in the data matrix, an additional group constraint is added to the optimization problem in order to sparsify the whole set of columns representing one qualitative variable. This method—called group-sparse GSVD (gsGSVD)—integrates these constraints in a new algorithm via an iterative projection scheme onto the intersection of subspaces where each subspace implements a specific constraint. This algorithm is described in details, and we show how it can be adapted to the sparsification of simple and multiple correspondence analysis (as well as their barycentric discriminant analysis versions). This algorithm is illustrated with the analysis of four different data sets—each illustrating the sparsification of a particular CA-based method.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108179"},"PeriodicalIF":1.5,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Selecting time-series hyperparameters with the artificial jackknife","authors":"Filippo Pellegrino","doi":"10.1016/j.csda.2025.108173","DOIUrl":"10.1016/j.csda.2025.108173","url":null,"abstract":"<div><div>A generalisation of the delete-<em>d</em> jackknife is proposed for solving hyperparameter selection problems in time series. The method is referred to as the artificial delete-<em>d</em> jackknife, emphasizing that it replaces the classic removal step with a fictitious deletion, wherein observed data points are replaced with artificial missing values. This procedure preserves the data order, ensuring seamless compatibility with time series. The approach is asymptotically justified and its finite-sample properties are studied via simulations. In addition, an application based on foreign exchange rates illustrates its practical relevance.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108173"},"PeriodicalIF":1.5,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143680032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hidden semi-Markov models with inhomogeneous state dwell-time distributions","authors":"Jan-Ole Koslik","doi":"10.1016/j.csda.2025.108171","DOIUrl":"10.1016/j.csda.2025.108171","url":null,"abstract":"<div><div>The well-established methodology for the estimation of hidden semi-Markov models (HSMMs) as hidden Markov models (HMMs) with extended state spaces is further developed. Covariate influences are incorporated across all aspects of the state process model, in particular regarding the distributions governing the state dwell time. The special case of periodically varying covariate effects on the state dwell-time distributions — and possibly the conditional transition probabilities — is examined in detail. Important properties of these models are derived, including the periodically varying unconditional state distribution as well as the overall state dwell-time distribution. Simulation studies are conducted to assess key properties of these models and provide recommendations for hyperparameter settings. A case study involving an HSMM with periodically varying dwell-time distributions is presented to analyse the movement trajectory of an Arctic muskox, demonstrating the practical relevance of the developed methodology.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108171"},"PeriodicalIF":1.5,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143644147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Model-based edge clustering for weighted networks with a noise component","authors":"Haomin Li, Daniel K. Sewell","doi":"10.1016/j.csda.2025.108172","DOIUrl":"10.1016/j.csda.2025.108172","url":null,"abstract":"<div><div>Clustering is a fundamental task in network analysis, essential for uncovering hidden structures within complex systems. Edge clustering, which focuses on relationships between nodes rather than the nodes themselves, has gained increased attention in recent years. However, existing edge clustering algorithms often overlook the significance of edge weights, which can represent the strength or capacity of connections, and fail to account for noisy edges—connections that obscure the true structure of the network. To address these challenges, the Weighted Edge Clustering Adjusting for Noise (WECAN) model is introduced. This novel algorithm integrates edge weights into the clustering process and includes a noise component that filters out spurious edges. WECAN offers a data-driven approach to distinguishing between meaningful and noisy edges, avoiding the arbitrary thresholding commonly used in network analysis. Its effectiveness is demonstrated through simulation studies and applications to real-world datasets, showing significant improvements over traditional clustering methods. Additionally, the R package “WECAN”<span><span><sup>1</sup></span></span> has been developed to facilitate its practical implementation.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108172"},"PeriodicalIF":1.5,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143644209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Functional nonlinear principal component analysis","authors":"Qingzhi Zhong , Xinyuan Song","doi":"10.1016/j.csda.2025.108169","DOIUrl":"10.1016/j.csda.2025.108169","url":null,"abstract":"<div><div>The widely adopted dimension reduction technique, functional principal component analysis (FPCA), typically represents functional data as a linear combination of functional principal components (FPCs) and their corresponding scores. However, this linear formulation is too restrictive to reflect reality because it fails to capture the nonlinear dependence of functional data when nonlinear features are present in the data. This study develops a novel FPCA model to uncover the nonlinear structures of functional data. The proposed method can accommodate multivariate functional data observed on different domains, and multidimensional functional data with gaps and holes. To navigate the complexities of spatial structure in multidimensional functional variables, tensor product smoothing and spline smoothing over triangulation are employed, providing precise tools for approximating nonparametric function. Furthermore, an efficient estimation approach and theory are developed when the number of FPCs diverges to infinity. To assess its performance comprehensively, extensive simulations are conducted, and the proposed method is applied to real data from the Alzheimer's Disease Neuroimaging Initiative study, affirming its practical efficacy in uncovering and interpreting nonlinear structures inherent in functional data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108169"},"PeriodicalIF":1.5,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143644148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Manifold-valued models for analysis of EEG time series data","authors":"Tao Ding , Tom M.W. Nye , Yujiang Wang","doi":"10.1016/j.csda.2025.108168","DOIUrl":"10.1016/j.csda.2025.108168","url":null,"abstract":"<div><div>EEG (electroencephalogram) records brain electrical activity and is a vital clinical tool in the diagnosis and treatment of epilepsy. Time series of covariance matrices between EEG channels for patients suffering from epilepsy, obtained from an open-source dataset, are analysed. The aim is two-fold: to develop a model with interpretable parameters for different possible modes of EEG dynamics, and to explore the extent to which modelling results are affected by the choice of geometry imposed on the space of covariance matrices. The space of full-rank covariance matrices of fixed dimension forms a smooth manifold, and any statistical analysis inherently depends on the choice of metric or Riemannian structure on this manifold. The model specifies a distribution for the tangent direction vector at any time point, combining an autoregressive term, a mean reverting term and a form of Gaussian noise. Parameter inference is performed by maximum likelihood estimation, and we compare modelling results obtained using the standard Euclidean geometry and the affine invariant geometry on covariance matrices. The findings reveal distinct dynamics between epileptic seizures and interictal periods (between seizures), with interictal series characterized by strong mean reversion and absence of autoregression, while seizures exhibit significant autoregressive components with weaker mean reversion. The fitted models are also used to measure seizure dissimilarity within and between patients.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108168"},"PeriodicalIF":1.5,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143680033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Regression analysis of elliptically symmetric directional data","authors":"Zehao Yu, Xianzheng Huang","doi":"10.1016/j.csda.2025.108167","DOIUrl":"10.1016/j.csda.2025.108167","url":null,"abstract":"<div><div>A comprehensive toolkit is developed for regression analysis of directional data based on a flexible class of angular Gaussian distributions. Informative testing procedures to assess rotational symmetry around the mean direction, and the dependence of model parameters on covariates are proposed. Bootstrap-based algorithms are provided to assess the significance of the proposed test statistics. Moreover, a prediction region that achieves the smallest volume in a class of ellipsoidal prediction regions of the same coverage probability is constructed. The efficacy of these inference procedures is demonstrated in simulation experiments. Finally, this new toolkit is used to analyze directional data originating from a hydrology study and a bioinformatics application.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"208 ","pages":"Article 108167"},"PeriodicalIF":1.5,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}