{"title":"Discretization: Privacy-preserving data publishing for causal discovery","authors":"Youngmin Ahn , Woongjoon Park , Gunwoong Park","doi":"10.1016/j.csda.2025.108174","DOIUrl":"10.1016/j.csda.2025.108174","url":null,"abstract":"<div><div>As the importance of data privacy continues to grow, data masking has emerged as a crucial method. Notably, data masking techniques aim to protect individual privacy, while enabling data analysts to derive meaningful statistical results, such as the identification of directional or causal relationships between variables. Hence, this study demonstrates the advantages of a quantile-based discretization for protecting privacy and uncovering the relationships between variables in Gaussian directed acyclic graphical (DAG) models. Specifically, it introduces quantile-discretized Gaussian DAG models where each node variable is discretized based on the quantiles. Additionally, it proposes the bi-partition process, which aids in recovering the covariance matrix; hence, the models can be identifiable. Furthermore, a consistent algorithm is developed for learning the underlying structure using the quantile-based discretized data. Finally, through numerical experiments and the application of DAG learning algorithms to discretized MLB data, the proposed algorithm is demonstrated to significantly outperform the state-of-the-art DAG model learning algorithms.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108174"},"PeriodicalIF":1.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huimin Lu , Yilong Wang , Heming Bing , Shuying Wang , Niya Li
{"title":"Efficient regularized estimation of graphical proportional hazards model with interval-censored data","authors":"Huimin Lu , Yilong Wang , Heming Bing , Shuying Wang , Niya Li","doi":"10.1016/j.csda.2025.108178","DOIUrl":"10.1016/j.csda.2025.108178","url":null,"abstract":"<div><div>Variable selection is discussed in many cases in survival analysis. In particular, the analysis of using proportional hazards (PH) models to deal with censored survival data has established a large amount of literature. Based on interval-censored data, this paper discusses the situation of complex network structures existing in covariates. To address the issue, a more flexible and versatile PH model has been developed by combining probabilistic graphical models with PH models, to describe the correlation between covariates. Based on the block coordinate descent method, a penalized estimation method is proposed, which can simultaneously perform variable selection and parameter estimation. The effectiveness of the proposed model and its parameter estimation method are evaluated through simulation studies and the analysis of clinical trial data related to Alzheimer's disease, confirming the reliability and accuracy of the proposed model and method.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108178"},"PeriodicalIF":1.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Linear covariance selection model via ℓ1-penalization","authors":"Kwan-Young Bak , Seongoh Park","doi":"10.1016/j.csda.2025.108176","DOIUrl":"10.1016/j.csda.2025.108176","url":null,"abstract":"<div><div>This paper presents a study on an <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span>-penalized covariance regression method. Conventional approaches in high-dimensional covariance estimation often lack the flexibility to integrate external information. As a remedy, we adopt the regression-based covariance modeling framework and introduce a linear covariance selection model (LCSM) to encompass a broader spectrum of covariance structures when covariate information is available. Unlike existing methods, we do not assume that the true covariance matrix can be exactly represented by a linear combination of known basis matrices. Instead, we adopt additional basis matrices for a portion of the covariance patterns not captured by the given bases. To estimate high-dimensional regression coefficients, we exploit the sparsity-inducing <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span>-penalization scheme. Our theoretical analyses are based on the (symmetric) matrix regression model with additive random error matrix, which allows us to establish new non-asymptotic convergence rates of the proposed covariance estimator. The proposed method is implemented with the coordinate descent algorithm. We conduct empirical evaluation on simulated data to complement theoretical findings and underscore the efficacy of our approach. To show a practical applicability of our method, we further apply it to the co-expression analysis of liver gene expression data where the given basis corresponds to the adjacency matrix of the co-expression network.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108176"},"PeriodicalIF":1.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143725905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A deflation-adjusted Bayesian information criterion for selecting the number of clusters in K-means clustering","authors":"Masao Ueki","doi":"10.1016/j.csda.2025.108170","DOIUrl":"10.1016/j.csda.2025.108170","url":null,"abstract":"<div><div>A deflation-adjusted Bayesian information criterion is proposed by introducing a closed-form adjustment to the variance estimate after K-means clustering. An expected lower bound of the deflation in the variance estimate after K-means clustering is derived and used as an adjustment factor for the variance estimates. The deflation-adjusted variance estimates are applied to the Bayesian information criterion under the Gaussian model for selecting the number of clusters. The closed-form expression makes the proposed deflation-adjusted Bayesian information criterion computationally efficient. Simulation studies show that the deflation-adjusted Bayesian information criterion performs better than other existing clustering methods in some situations, including K-means clustering with the number of clusters selected by standard Bayesian information criteria, the gap statistic, the average silhouette score, the prediction strength, and clustering using a Gaussian mixture model with the Bayesian information criterion. The proposed method is illustrated through a real data application for clustering human genomic data from the 1000 Genomes Project.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108170"},"PeriodicalIF":1.5,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yichen Lou , Yuqing Ma , Liming Xiang , Jianguo Sun
{"title":"A multiple imputation approach for flexible modelling of interval-censored data with missing and censored covariates","authors":"Yichen Lou , Yuqing Ma , Liming Xiang , Jianguo Sun","doi":"10.1016/j.csda.2025.108177","DOIUrl":"10.1016/j.csda.2025.108177","url":null,"abstract":"<div><div>This paper discusses regression analysis of interval-censored failure time data that commonly occur in biomedical studies among others. For the situation, the failure event of interest is known only to occur within an interval instead of being observed exactly. In addition to interval censoring on the failure time of interest, sometimes covariates may be missing or suffer censoring, which can bring extra theoretical and computational challenges for the regression analysis. To deal with such data, we propose a novel multiple imputation approach with the use of the rejection sampling under a class of semiparametric transformation models. The proposed method is flexible and can lead to more efficient estimation than the existing methods, and the resulting estimators are shown to be consistent and asymptotically normal. An extensive simulation study is conducted and demonstrates that the proposed approach works well in practice. Finally, we apply the proposed approach to a set of real data on Alzheimer's disease that motivated this study.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108177"},"PeriodicalIF":1.5,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143714600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Selecting time-series hyperparameters with the artificial jackknife","authors":"Filippo Pellegrino","doi":"10.1016/j.csda.2025.108173","DOIUrl":"10.1016/j.csda.2025.108173","url":null,"abstract":"<div><div>A generalisation of the delete-<em>d</em> jackknife is proposed for solving hyperparameter selection problems in time series. The method is referred to as the artificial delete-<em>d</em> jackknife, emphasizing that it replaces the classic removal step with a fictitious deletion, wherein observed data points are replaced with artificial missing values. This procedure preserves the data order, ensuring seamless compatibility with time series. The approach is asymptotically justified and its finite-sample properties are studied via simulations. In addition, an application based on foreign exchange rates illustrates its practical relevance.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108173"},"PeriodicalIF":1.5,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143680032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hidden semi-Markov models with inhomogeneous state dwell-time distributions","authors":"Jan-Ole Koslik","doi":"10.1016/j.csda.2025.108171","DOIUrl":"10.1016/j.csda.2025.108171","url":null,"abstract":"<div><div>The well-established methodology for the estimation of hidden semi-Markov models (HSMMs) as hidden Markov models (HMMs) with extended state spaces is further developed. Covariate influences are incorporated across all aspects of the state process model, in particular regarding the distributions governing the state dwell time. The special case of periodically varying covariate effects on the state dwell-time distributions — and possibly the conditional transition probabilities — is examined in detail. Important properties of these models are derived, including the periodically varying unconditional state distribution as well as the overall state dwell-time distribution. Simulation studies are conducted to assess key properties of these models and provide recommendations for hyperparameter settings. A case study involving an HSMM with periodically varying dwell-time distributions is presented to analyse the movement trajectory of an Arctic muskox, demonstrating the practical relevance of the developed methodology.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108171"},"PeriodicalIF":1.5,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143644147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Model-based edge clustering for weighted networks with a noise component","authors":"Haomin Li, Daniel K. Sewell","doi":"10.1016/j.csda.2025.108172","DOIUrl":"10.1016/j.csda.2025.108172","url":null,"abstract":"<div><div>Clustering is a fundamental task in network analysis, essential for uncovering hidden structures within complex systems. Edge clustering, which focuses on relationships between nodes rather than the nodes themselves, has gained increased attention in recent years. However, existing edge clustering algorithms often overlook the significance of edge weights, which can represent the strength or capacity of connections, and fail to account for noisy edges—connections that obscure the true structure of the network. To address these challenges, the Weighted Edge Clustering Adjusting for Noise (WECAN) model is introduced. This novel algorithm integrates edge weights into the clustering process and includes a noise component that filters out spurious edges. WECAN offers a data-driven approach to distinguishing between meaningful and noisy edges, avoiding the arbitrary thresholding commonly used in network analysis. Its effectiveness is demonstrated through simulation studies and applications to real-world datasets, showing significant improvements over traditional clustering methods. Additionally, the R package “WECAN”<span><span><sup>1</sup></span></span> has been developed to facilitate its practical implementation.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108172"},"PeriodicalIF":1.5,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143644209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Functional nonlinear principal component analysis","authors":"Qingzhi Zhong , Xinyuan Song","doi":"10.1016/j.csda.2025.108169","DOIUrl":"10.1016/j.csda.2025.108169","url":null,"abstract":"<div><div>The widely adopted dimension reduction technique, functional principal component analysis (FPCA), typically represents functional data as a linear combination of functional principal components (FPCs) and their corresponding scores. However, this linear formulation is too restrictive to reflect reality because it fails to capture the nonlinear dependence of functional data when nonlinear features are present in the data. This study develops a novel FPCA model to uncover the nonlinear structures of functional data. The proposed method can accommodate multivariate functional data observed on different domains, and multidimensional functional data with gaps and holes. To navigate the complexities of spatial structure in multidimensional functional variables, tensor product smoothing and spline smoothing over triangulation are employed, providing precise tools for approximating nonparametric function. Furthermore, an efficient estimation approach and theory are developed when the number of FPCs diverges to infinity. To assess its performance comprehensively, extensive simulations are conducted, and the proposed method is applied to real data from the Alzheimer's Disease Neuroimaging Initiative study, affirming its practical efficacy in uncovering and interpreting nonlinear structures inherent in functional data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108169"},"PeriodicalIF":1.5,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143644148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Manifold-valued models for analysis of EEG time series data","authors":"Tao Ding , Tom M.W. Nye , Yujiang Wang","doi":"10.1016/j.csda.2025.108168","DOIUrl":"10.1016/j.csda.2025.108168","url":null,"abstract":"<div><div>EEG (electroencephalogram) records brain electrical activity and is a vital clinical tool in the diagnosis and treatment of epilepsy. Time series of covariance matrices between EEG channels for patients suffering from epilepsy, obtained from an open-source dataset, are analysed. The aim is two-fold: to develop a model with interpretable parameters for different possible modes of EEG dynamics, and to explore the extent to which modelling results are affected by the choice of geometry imposed on the space of covariance matrices. The space of full-rank covariance matrices of fixed dimension forms a smooth manifold, and any statistical analysis inherently depends on the choice of metric or Riemannian structure on this manifold. The model specifies a distribution for the tangent direction vector at any time point, combining an autoregressive term, a mean reverting term and a form of Gaussian noise. Parameter inference is performed by maximum likelihood estimation, and we compare modelling results obtained using the standard Euclidean geometry and the affine invariant geometry on covariance matrices. The findings reveal distinct dynamics between epileptic seizures and interictal periods (between seizures), with interictal series characterized by strong mean reversion and absence of autoregression, while seizures exhibit significant autoregressive components with weaker mean reversion. The fitted models are also used to measure seizure dissimilarity within and between patients.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108168"},"PeriodicalIF":1.5,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143680033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}