{"title":"Robust selection of the number of change-points via FDR control","authors":"Hui Chen , Chengde Qian , Qin Zhou","doi":"10.1016/j.csda.2025.108272","DOIUrl":"10.1016/j.csda.2025.108272","url":null,"abstract":"<div><div>Robust quantification of uncertainty regarding the number of change-points presents a significant challenge in data analysis, particularly when employing false discovery rate (FDR) control techniques. Emphasizing the detection of genuine signals while controlling false positives is crucial, especially for identifying shifts in location parameters within flexible distributions. Traditional parametric methods often exhibit sensitivity to outliers and heavy-tailed data. Addressing this limitation, a robust method accommodating diverse data structures is proposed. The approach constructs component-wise sign-based statistics. Leveraging the global symmetry inherent in these statistics enables the derivation of data-driven thresholds suitable for multiple testing scenarios. Method development occurs within the framework of U-statistics, which naturally encompasses existing cumulative sum-based procedures. Theoretical guarantees establish FDR control for the component-wise sign-based method under mild assumptions. Demonstrations of effectiveness utilize simulations with synthetic data and analyses of real data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108272"},"PeriodicalIF":1.6,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measure selection for functional linear model","authors":"Su I Iao, Hans-Georg Müller","doi":"10.1016/j.csda.2025.108270","DOIUrl":"10.1016/j.csda.2025.108270","url":null,"abstract":"<div><div>Advancements in modern science have led to an increased prevalence of functional data, which are usually viewed as elements of the space of square-integrable functions <span><math><msup><mi>L</mi><mn>2</mn></msup></math></span>. Core methods in functional data analysis, such as functional principal component analysis, are typically grounded in the Hilbert structure of <span><math><msup><mi>L</mi><mn>2</mn></msup></math></span> and rely on inner products based on integrals with respect to the Lebesgue measure over a fixed domain. A more flexible framework is proposed, where the measure can be arbitrary, allowing natural extensions to unbounded domains and prompting the question of optimal measure choice. Specifically, a novel functional linear model is introduced that incorporates a data-adaptive choice of the measure that defines the space, alongside an enhanced function principal component analysis. Selecting a good measure can improve the model’s predictive performance, especially when the underlying processes are not well-represented when adopting the default Lebesgue measure. Simulations, as well as applications to COVID-19 data and the National Health and Nutrition Examination Survey data, show that the proposed approach consistently outperforms the conventional functional linear model.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108270"},"PeriodicalIF":1.6,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hang J. Kim , Steven N. MacEachern , Young Min Kim , Yoonsuh Jung
{"title":"Kernel density estimation with a Markov chain Monte Carlo sample","authors":"Hang J. Kim , Steven N. MacEachern , Young Min Kim , Yoonsuh Jung","doi":"10.1016/j.csda.2025.108271","DOIUrl":"10.1016/j.csda.2025.108271","url":null,"abstract":"<div><div>Bayesian inference relies on the posterior distribution, which is often estimated with a Markov chain Monte Carlo sampler. The sampler produces a dependent stream of variates from the limiting distribution of the Markov chain, the posterior distribution. When one wishes to display the estimated posterior density, a natural choice is the histogram. However, abundant literature has shown that the kernel density estimator is more accurate than the histogram in terms of mean integrated squared error for an i.i.d. sample. With this as motivation, a kernel density estimation method is proposed that is appropriate for the dependence in the Markov chain Monte Carlo output. To account for the dependence, the cross-validation criterion is modified to select the bandwidth in standard kernel density estimation approaches. A data-driven adjustment to the biased cross-validation method is suggested with introducing the integrated autocorrelation time of the kernel. The convergence of the modified bandwidth to the optimal bandwidth is shown by adapting theorems from the time series literature. Simulation studies show that the proposed method finds the bandwidth close to the optimal value, while standard methods lead to smaller bandwidths under Markov chain samples and hence to undersmoothed density estimates. A study with real data shows that the proposed method has a considerably smaller integrated mean squared error than standard methods. The R package <span>KDEmcmc</span> to implement the suggested algorithm is available on the Comprehensive R Archive Network.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108271"},"PeriodicalIF":1.6,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145049123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Overview of normal-reference tests for high-dimensional means with implementation in the R package ‘HDNRA’","authors":"Pengfei Wang , Tianming Zhu , Jin-Ting Zhang","doi":"10.1016/j.csda.2025.108269","DOIUrl":"10.1016/j.csda.2025.108269","url":null,"abstract":"<div><div>The challenge of testing for equal mean vectors in high-dimensional data poses significant difficulties in statistical inference. Much of the existing literature introduces methods that often rely on stringent regularity conditions for the underlying covariance matrices, enabling asymptotic normality of test statistics. However, this can lead to complications in controlling test size. To address these issues, a new set of tests has emerged, leveraging the normal-reference approach to improve reliability. The latest normal-reference methods for testing equality of mean vectors in high-dimensional samples, potentially with differing covariance structures, are reviewed. The theoretical underpinnings of these tests are revisited, providing a new unified justification for the validity of centralized <span><math><msup><mi>L</mi><mn>2</mn></msup></math></span>-norm-based normal-reference tests (NRTs) by deriving the convergence rate of the distance between the null distribution of the test statistic and its corresponding normal-reference distribution. To facilitate practical application, an <span>R</span> package, <span>HDNRA</span>, is introduced, implementing these NRTs and extending beyond the two-sample problem to accommodate general linear hypothesis testing (GLHT). The package, designed with user-friendliness in mind, achieves efficient computation through a core implemented in <span>C++</span> using <span>Rcpp</span>, <span>OpenMP</span>, and <span>RcppArmadillo</span>. Examples with real datasets are included, showcasing the application of various tests and providing insights into their practical utility.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108269"},"PeriodicalIF":1.6,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145010731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-dimensional subgroup functional quantile regression with panel and dependent data","authors":"Xiao-Ge Yu, Han-Ying Liang","doi":"10.1016/j.csda.2025.108268","DOIUrl":"10.1016/j.csda.2025.108268","url":null,"abstract":"<div><div>High-dimensional additive functional partial linear single-index quantile regression with high-dimensional parameters under subgroup panel data is investigated. Based on spline-based approach, we construct oracle estimators of the unknown parameter and functions, and discuss their consistency with rates and asymptotic normality under <span><math><mi>α</mi></math></span>-mixing assumptions. A penalized estimation method by using the SCAD technique is introduced to estimate the additive functions and parameter, enabling variable selection and automatic identification of the number of groups. Hypothesis testing for the parameter is also considered, and the asymptotic distributions of the restricted estimators and the test statistic are derived under both the null and local alternative hypotheses. Simulation studies and real data analysis are conducted to verify the validity of the proposed methods and applications.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108268"},"PeriodicalIF":1.6,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144934291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering causal structures in corrupted data: frugality in anchored Gaussian DAG models","authors":"Joonho Shin , Junhyoung Chung , Seyong Hwang , Gunwoong Park","doi":"10.1016/j.csda.2025.108267","DOIUrl":"10.1016/j.csda.2025.108267","url":null,"abstract":"<div><div>This study focuses on the recovery of anchored Gaussian directed acyclic graphical (DAG) models to address the challenge of discovering causal or directed relationships among variables in datasets that are either intentionally masked or contaminated due to measurement errors. A main contribution is to relax the existing restrictive identifiability conditions for anchored Gaussian DAG models by introducing the anchored-frugality assumption. This assumption posits that the true graph is the most frugal among those satisfying the possible distributions of the latent and observed variables, thereby making the true Markov equivalent class (MEC) identifiable. The validity of the anchored-frugality assumption is justified using both graph and probability theories, respectively. Another main contribution is the development of the anchored-SP and frugal-PC algorithms. Specifically, the anchored-SP algorithm finds the most frugal graph among all possible graphs satisfying the Markov condition while the frugal-PC algorithm finds the most frugal graph among some graphs. Hence, the frugal-PC algorithm is more computationally feasible, while it requires an additional frugality-faithfulness assumption for soundness. Various simulations support the theoretical findings of this study and demonstrate the practical effectiveness of the proposed algorithm against state-of-the-art algorithms such as ACI, PC, and MMHC. Furthermore, the applications of the proposed algorithm to protein signaling data and breast cancer data illustrate its effectiveness in uncovering relationships among proteins and among cancer-related cell nuclei characteristics.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108267"},"PeriodicalIF":1.6,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144908977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Empirical likelihood based Bayesian variable selection","authors":"Yichen Cheng , Yichuan Zhao","doi":"10.1016/j.csda.2025.108258","DOIUrl":"10.1016/j.csda.2025.108258","url":null,"abstract":"<div><div>Empirical likelihood is a popular nonparametric statistical tool that does not require any distributional assumptions. The possibility of conducting variable selection via Bayesian empirical likelihood is studied both theoretically and empirically. Theoretically, it is shown that when the prior distribution satisfies certain mild conditions, the corresponding Bayesian empirical likelihood estimators are posteriorly consistent and variable selection consistent. As special cases, the prior of Bayesian empirical likelihood LASSO and SCAD satisfy such conditions and thus can identify the non-zero elements of the parameters with probability approaching 1. In addition, it is easy to verify that those conditions are met for other widely used priors such as ridge, elastic net and adaptive LASSO. Empirical likelihood depends on a parameter that needs to be obtained by numerically solving a non-linear equation. Thus, there exists no conjugate prior for the posterior distribution, which causes the slow convergence of the MCMC sampling algorithm in some cases. To solve this problem, an approximation distribution is used as the proposal to enhance the acceptance rate and, therefore, facilitate faster computation. The computational results demonstrate quick convergence for the examples used in the paper. Both simulations and real data analyses are performed to illustrate the advantages of the proposed methods.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108258"},"PeriodicalIF":1.6,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144893327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimation of semiparametric probit model based on case-cohort interval-censored failure time data","authors":"Mingyue Du, Ricong Zeng","doi":"10.1016/j.csda.2025.108266","DOIUrl":"10.1016/j.csda.2025.108266","url":null,"abstract":"<div><div>The estimation of semiparametric probit model is discussed for the situation where one observes interval-censored failure time data arising from case-cohort studies. The probit model has recently attracted some attention for regression analysis of failure time data partly due to the popularity of the normal distribution and its similarity to linear models. Although some methods have been developed in the literature for its estimation, it does not seem to exist an established approach for the situation of case-cohort interval-censored data. To address this, a pseudo-maximum likelihood method is proposed and furthermore, an EM algorithm is developed for its implementation. The resulting estimators of regression parameters are shown to be consistent and asymptotically follow the normal distribution. To assess the empirical performance of the proposed method, a simulation study is conducted and indicates that it works well in practical situations. In addition, it is applied to a set of real data arising from an AIDS clinical trial that motivated this study.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108266"},"PeriodicalIF":1.6,"publicationDate":"2025-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144886727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating a smooth covariance for functional data","authors":"Uche Mbaka , James Owen Ramsay , Michelle Carey","doi":"10.1016/j.csda.2025.108255","DOIUrl":"10.1016/j.csda.2025.108255","url":null,"abstract":"<div><div>Functional data analysis frequently involves estimating a smooth covariance function based on observed data. This estimation is essential for understanding interactions among functions and constitutes a fundamental aspect of numerous advanced methodologies, including functional principal component analysis. Two approaches for estimating smooth covariance functions in the presence of measurement errors are introduced. The first method employs a low-rank approximation of the covariance matrix, while the second ensures positive definiteness via a Cholesky decomposition. Both approaches employ the use of penalized regression to produce smooth covariance estimates and have been validated through comprehensive simulation studies. The practical application of these methods is demonstrated through the examination of average weekly milk yields in dairy cows as well as egg-laying patterns of Mediterranean fruit flies.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108255"},"PeriodicalIF":1.6,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144886725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Variable selection in AUC-optimizing classification","authors":"Hyungwoo Kim , Seung Jun Shin","doi":"10.1016/j.csda.2025.108256","DOIUrl":"10.1016/j.csda.2025.108256","url":null,"abstract":"<div><div>Optimizing the receiver operating characteristic (ROC) curve is a popular way to evaluate a binary classifier under imbalanced scenarios frequently encountered in practice. A practical approach to constructing a linear binary classifier is presented by simultaneously optimizing the area under the ROC curve (AUC) and selecting informative variables in high dimensions. In particular, the smoothly clipped absolute deviation (SCAD) penalty is employed, and its oracle property is established, which enables the development of a consistent BIC-type information criterion that greatly facilitates the tuning procedure. Both simulated and real data analyses demonstrate the promising performance of the proposed method in terms of AUC optimization and variable selection.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108256"},"PeriodicalIF":1.6,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}