{"title":"Denoising over networks with applications to partially observed epidemics","authors":"Claire Donnat , Olga Klopp , Nicolas Verzelen","doi":"10.1016/j.csda.2025.108276","DOIUrl":"10.1016/j.csda.2025.108276","url":null,"abstract":"<div><div>A novel method is introduced for denoising partially observed signals over networks using graph total variation (TV) regularization, a technique adapted from signal processing to handle binary data. This approach extends existing results derived for Gaussian data to the discrete, binary case — a method hereafter referred to as “one-bit TV denoising.” The framework considers a network represented as a set of nodes with binary observations, where edges encode pairwise relationships between nodes. A key theoretical contribution is the establishment of consistency guarantees of graph TV denoising for the recovery of underlying node-level probabilities. The method is well suited for settings with missing data, enabling robust inference from incomplete observations. Extensive numerical experiments and real-world applications further highlight its effectiveness, underscoring its potential in various practical scenarios that require denoising and prediction on networks with binary-valued data. Finally, applications to two real-world epidemic scenarios demonstrate that one-bit total variation denoising significantly enhances the accuracy of network-based nowcasting and forecasting.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108276"},"PeriodicalIF":1.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145322732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hwiyoung Lee , Zhenyao Ye , Chixiang Chen , Peter Kochunov , L. Elliot Hong , Shuo Chen
{"title":"Fast autoregressive model for multivariate dependent outcomes with application to lipidomics analysis for Alzheimer’s disease and APOE-ε4","authors":"Hwiyoung Lee , Zhenyao Ye , Chixiang Chen , Peter Kochunov , L. Elliot Hong , Shuo Chen","doi":"10.1016/j.csda.2025.108280","DOIUrl":"10.1016/j.csda.2025.108280","url":null,"abstract":"<div><div>Association analysis of multivariate omics outcomes is challenging due to the high dimensionality and inter-correlation among outcome variables. In practice, the classic multi-univariate analysis approaches are commonly employed, utilizing linear regression models for each individual outcome followed by adjustments for multiplicity through control of the false discovery rate (FDR) or family-wise error rate (FWER). While straightforward, these multi-univariate methods overlook dependencies between outcome variables. This oversight leads to less accurate statistical inferences, characterized by lower power and an increased false discovery rate, ultimately resulting in reduced replicability across studies. Recently, advanced frequentist and Bayesian methods have been developed to account for these dependencies. However, these methods often pose significant computational challenges for researchers in the field. To bridge this gap, a computationally efficient autoregressive multivariate regression model is proposed that explicitly accounts for the dependence structure among outcome variables. Through extensive simulations, it is demonstrated that the approach provides more accurate multivariate inferences than traditional methods and remains robust even under model misspecification. Additionally, the proposed method is applied to investigate whether the associations between serum lipidomics outcomes and Alzheimer’s disease differentiate in <span><math><mrow><mrow><mi>ε</mi></mrow><mn>4</mn></mrow></math></span> allele carriers and non-carriers of the apolipoprotein E (APOE) gene.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108280"},"PeriodicalIF":1.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring multivariate regression association via spatial sign","authors":"Jia-Han Shih , Yi-Hau Chen","doi":"10.1016/j.csda.2025.108288","DOIUrl":"10.1016/j.csda.2025.108288","url":null,"abstract":"<div><div>A regression association measure is proposed for capturing predictability of a multivariate outcome <span><math><mrow><mi>Y</mi><mo>=</mo><mo>(</mo><msub><mi>Y</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>Y</mi><mi>d</mi></msub><mo>)</mo></mrow></math></span> from a multivariate covariate <span><math><mrow><mi>X</mi><mo>=</mo><mo>(</mo><msub><mi>X</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>X</mi><mi>p</mi></msub><mo>)</mo></mrow></math></span>. Motivated by existing measures, the conventional Kendall’s tau is first generalized to measure multivariate association between two random vectors. Then the predictability of <span><math><mi>Y</mi></math></span> from <span><math><mi>X</mi></math></span> is measured by the generalized multivariate Kendall’s tau between <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span>, where <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span> share the same conditional distribution and are conditionally independent given <span><math><mi>X</mi></math></span>. The proposed regression association measure can be expressed as the proportion of the variance of a function of <span><math><mi>Y</mi></math></span> that can be explained by <span><math><mi>X</mi></math></span>, indicating that the measure has a direct interpretation in terms of predictability. Based on the proposed measure, a conditional regression association measure is further proposed, which can be utilized to perform variable selection. Since the proposed measures are based on <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span>, a simple nonparametric estimation method based on nearest neighbors is available. An R package, <span>MRAM</span>, has been developed for implementation. Simulation studies are carried out to assess the performance of the proposed methods and real data examples are analyzed for illustration.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108288"},"PeriodicalIF":1.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145322731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modelling catastrophic extinction in stochastic birth-death process: Analytical insights, estimation, and efficient simulation","authors":"Clement Twumasi","doi":"10.1016/j.csda.2025.108302","DOIUrl":"10.1016/j.csda.2025.108302","url":null,"abstract":"<div><div>A comprehensive analytical and computational framework is developed for the linear birth-death process (LBDP) with catastrophic extinction (BDC process), a continuous-time Markov model that incorporates sudden extinction events into the classical LBDP. Despite its conceptual simplicity, the underlying BDC process poses substantial challenges in deriving exact transition probabilities and performing reliable parameter estimation, particularly under discrete-time observations. While previous work established foundational properties using spectral methods and probability generating functions (PGFs), explicit analytical expressions for transition probabilities and theoretical moments have remained unavailable, limiting practical applications in extinction-prone systems. This limitation is addressed by reparameterising the PGF through functional restructuring, yielding exact closed-form expressions for the transition probability function and the theoretical moments of the discretely observed BDC process, with results validated through comprehensive numerical experiments for the first time. Three parameter estimation approaches tailored to the BDC process are introduced and evaluated: maximum likelihood estimation (MLE), generalised method of moments (GMM), and an embedded Galton-Watson (GW) approach, with trade-offs between computational efficiency and estimation accuracy examined across diverse simulation scenarios. To improve scalability, a Monte Carlo simulation framework based on a hybrid tau-leaping algorithm is formulated, specifically adapted to extinction-driven dynamics, offering a computationally efficient alternative to the exact stochastic simulation algorithm (SSA). The proposed methodologies offer a tractable and scalable foundation for incorporating the BDC process into applied stochastic models, particularly in ecological, epidemiological, and biological systems where populations are susceptible to sudden collapse due to catastrophic events such as host mortality or immune response.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108302"},"PeriodicalIF":1.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Overview of normal-reference tests for high-dimensional means with implementation in the R package ‘HDNRA’","authors":"Pengfei Wang , Tianming Zhu , Jin-Ting Zhang","doi":"10.1016/j.csda.2025.108269","DOIUrl":"10.1016/j.csda.2025.108269","url":null,"abstract":"<div><div>The challenge of testing for equal mean vectors in high-dimensional data poses significant difficulties in statistical inference. Much of the existing literature introduces methods that often rely on stringent regularity conditions for the underlying covariance matrices, enabling asymptotic normality of test statistics. However, this can lead to complications in controlling test size. To address these issues, a new set of tests has emerged, leveraging the normal-reference approach to improve reliability. The latest normal-reference methods for testing equality of mean vectors in high-dimensional samples, potentially with differing covariance structures, are reviewed. The theoretical underpinnings of these tests are revisited, providing a new unified justification for the validity of centralized <span><math><msup><mi>L</mi><mn>2</mn></msup></math></span>-norm-based normal-reference tests (NRTs) by deriving the convergence rate of the distance between the null distribution of the test statistic and its corresponding normal-reference distribution. To facilitate practical application, an <span>R</span> package, <span>HDNRA</span>, is introduced, implementing these NRTs and extending beyond the two-sample problem to accommodate general linear hypothesis testing (GLHT). The package, designed with user-friendliness in mind, achieves efficient computation through a core implemented in <span>C++</span> using <span>Rcpp</span>, <span>OpenMP</span>, and <span>RcppArmadillo</span>. Examples with real datasets are included, showcasing the application of various tests and providing insights into their practical utility.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108269"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145010731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast and efficient causal inference in large-scale data via subsampling and projection calibration","authors":"Miaomiao Su","doi":"10.1016/j.csda.2025.108281","DOIUrl":"10.1016/j.csda.2025.108281","url":null,"abstract":"<div><div>Estimating the average treatment effect in large-scale datasets faces significant computational and storage challenges. Subsampling has emerged as a critical strategy to mitigate these issues. This paper proposes a novel subsampling method that builds on the G-estimation method offering the double robustness property. The proposed method uses a small subset of data to estimate computationally complex nuisance parameters, while leveraging the full dataset for the computationally simple final estimation. To ensure that the resulting estimator remains first-order insensitive to variations in nuisance parameters, a projection approach is introduced to optimize the estimation of the outcome regression function and treatment regression function such that the Neyman orthogonality conditions are satisfied. It is shown that the resulting estimator is asymptotically normal and achieves the same convergence rate as the full data-based estimator when either the treatment or the outcome models is correctly specified. Additionally, when both models are correctly specified, the proposed estimator achieves the same asymptotic variance as the full data-based estimator. The finite sample performance of the proposed method is demonstrated through simulation studies and an application to birth data, comprising over 30 million observations collected over the past eight years. Numerical results indicate that the proposed estimator is nearly as computationally efficient as the uniform subsampling estimator, while achieving similar estimation efficiency to the full data-based G-estimator.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108281"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145158570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust selection of the number of change-points via FDR control","authors":"Hui Chen , Chengde Qian , Qin Zhou","doi":"10.1016/j.csda.2025.108272","DOIUrl":"10.1016/j.csda.2025.108272","url":null,"abstract":"<div><div>Robust quantification of uncertainty regarding the number of change-points presents a significant challenge in data analysis, particularly when employing false discovery rate (FDR) control techniques. Emphasizing the detection of genuine signals while controlling false positives is crucial, especially for identifying shifts in location parameters within flexible distributions. Traditional parametric methods often exhibit sensitivity to outliers and heavy-tailed data. Addressing this limitation, a robust method accommodating diverse data structures is proposed. The approach constructs component-wise sign-based statistics. Leveraging the global symmetry inherent in these statistics enables the derivation of data-driven thresholds suitable for multiple testing scenarios. Method development occurs within the framework of U-statistics, which naturally encompasses existing cumulative sum-based procedures. Theoretical guarantees establish FDR control for the component-wise sign-based method under mild assumptions. Demonstrations of effectiveness utilize simulations with synthetic data and analyses of real data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108272"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gamma approximation of stratified truncated exact test (GASTE-test) & application","authors":"Alexandre Wendling, Clovis Galiez","doi":"10.1016/j.csda.2025.108277","DOIUrl":"10.1016/j.csda.2025.108277","url":null,"abstract":"<div><div>The analysis of binary outcomes and features, such as the effect of vaccination on health, often rely on 2 <span><math><mo>×</mo></math></span> 2 contingency tables. However, confounding factors such as age or gender call for stratified analysis, by creating sub-tables, which is common in bioscience, epidemiological, and social research, as well as in meta-analyses. Traditional methods for testing associations across strata, such as the Cochran-Mantel-Haenszel (CMH) test, struggle with small sample sizes and heterogeneity of effects between strata. Exact tests can address these issues, but are computationally expensive. To address these challenges, the Gamma Approximation of Stratified Truncated Exact (GASTE) test is proposed. It approximates the exact statistic of the combination of p-values with discrete support, leveraging the gamma distribution to approximate the distribution of the test statistic under stratification, providing fast and accurate p-value calculations, even when effects vary between strata. The GASTE method maintains high statistical power and low type I error rates, outperforming traditional methods by offering more sensitive and reliable detection. It is computationally efficient and broadens the applicability of exact tests in research fields with stratified binary data. The GASTE method is demonstrated through two applications: an ecological study of Alpine plant associations and a 1973 case study on admissions at the University of California, Berkeley. The GASTE method offers substantial improvements over traditional approaches. The GASTE method is available as an open-source package at <span><span>https://github.com/AlexandreWen/gaste</span><svg><path></path></svg></span>. A Python package is available on PyPI at <span><span>https://pypi.org/project/gaste-test/</span><svg><path></path></svg></span></div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108277"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145221243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hang J. Kim , Steven N. MacEachern , Young Min Kim , Yoonsuh Jung
{"title":"Kernel density estimation with a Markov chain Monte Carlo sample","authors":"Hang J. Kim , Steven N. MacEachern , Young Min Kim , Yoonsuh Jung","doi":"10.1016/j.csda.2025.108271","DOIUrl":"10.1016/j.csda.2025.108271","url":null,"abstract":"<div><div>Bayesian inference relies on the posterior distribution, which is often estimated with a Markov chain Monte Carlo sampler. The sampler produces a dependent stream of variates from the limiting distribution of the Markov chain, the posterior distribution. When one wishes to display the estimated posterior density, a natural choice is the histogram. However, abundant literature has shown that the kernel density estimator is more accurate than the histogram in terms of mean integrated squared error for an i.i.d. sample. With this as motivation, a kernel density estimation method is proposed that is appropriate for the dependence in the Markov chain Monte Carlo output. To account for the dependence, the cross-validation criterion is modified to select the bandwidth in standard kernel density estimation approaches. A data-driven adjustment to the biased cross-validation method is suggested with introducing the integrated autocorrelation time of the kernel. The convergence of the modified bandwidth to the optimal bandwidth is shown by adapting theorems from the time series literature. Simulation studies show that the proposed method finds the bandwidth close to the optimal value, while standard methods lead to smaller bandwidths under Markov chain samples and hence to undersmoothed density estimates. A study with real data shows that the proposed method has a considerably smaller integrated mean squared error than standard methods. The R package <span>KDEmcmc</span> to implement the suggested algorithm is available on the Comprehensive R Archive Network.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108271"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145049123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measure selection for functional linear model","authors":"Su I Iao, Hans-Georg Müller","doi":"10.1016/j.csda.2025.108270","DOIUrl":"10.1016/j.csda.2025.108270","url":null,"abstract":"<div><div>Advancements in modern science have led to an increased prevalence of functional data, which are usually viewed as elements of the space of square-integrable functions <span><math><msup><mi>L</mi><mn>2</mn></msup></math></span>. Core methods in functional data analysis, such as functional principal component analysis, are typically grounded in the Hilbert structure of <span><math><msup><mi>L</mi><mn>2</mn></msup></math></span> and rely on inner products based on integrals with respect to the Lebesgue measure over a fixed domain. A more flexible framework is proposed, where the measure can be arbitrary, allowing natural extensions to unbounded domains and prompting the question of optimal measure choice. Specifically, a novel functional linear model is introduced that incorporates a data-adaptive choice of the measure that defines the space, alongside an enhanced function principal component analysis. Selecting a good measure can improve the model’s predictive performance, especially when the underlying processes are not well-represented when adopting the default Lebesgue measure. Simulations, as well as applications to COVID-19 data and the National Health and Nutrition Examination Survey data, show that the proposed approach consistently outperforms the conventional functional linear model.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108270"},"PeriodicalIF":1.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}