{"title":"Statistical inference for partially shape-constrained function-on-scalar linear regression models","authors":"Kyunghee Han , Yeonjoo Park , Soo-Young Kim","doi":"10.1016/j.csda.2025.108200","DOIUrl":"10.1016/j.csda.2025.108200","url":null,"abstract":"<div><div>Functional linear regression models are widely used to link functional/longitudinal outcomes with multiple scalar predictors, identifying time-varying covariate effects through regression coefficient functions. Beyond assessing statistical significance, characterizing the shapes of coefficient functions is crucial for drawing interpretable scientific conclusions. Existing studies on shape-constrained analysis primarily focus on global shapes, which require strict prior knowledge of functional relationships across the entire domain. This often leads to misspecified regression models due to a lack of prior information, making them impractical for real-world applications. To address this, a flexible framework is introduced to identify partial shapes in regression coefficient functions. The proposed partial shape-constrained analysis enables researchers to validate functional shapes within a targeted sub-domain, avoiding the misspecification of shape constraints outside the sub-domain of interest. The method also allows for testing different sub-domains for individual covariates and multiple partial shape constraints across composite sub-domains. Our framework supports both kernel- and spline-based estimation approaches, ensuring robust performance with flexibility in computational preference. Finite-sample experiments across various scenarios demonstrate that the proposed framework significantly outperforms the application of global shape constraints to partial domains in both estimation and inference procedures. The inferential tool particularly maintains the type I error rate at the nominal significance level and exhibits increasing power with larger sample sizes, confirming the consistency of the test procedure. The practicality of partial shape-constrained inference is demonstrated through two applications: a clinical trial on NeuroBloc for type A-resistant cervical dystonia and the National Institute of Mental Health Schizophrenia Study.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108200"},"PeriodicalIF":1.5,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144083910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed variable screening for generalized linear models","authors":"Tianbo Diao , Bo Li , Lianqiang Qu , Liuquan Sun","doi":"10.1016/j.csda.2025.108203","DOIUrl":"10.1016/j.csda.2025.108203","url":null,"abstract":"<div><div>In this article, we develop a distributed variable screening method for generalized linear models. This method is designed to handle situations where both the sample size and the number of covariates are large. Specifically, the proposed method selects relevant covariates by using a sparsity-restricted surrogate likelihood estimator. It takes into account the joint effects of the covariates rather than just the marginal effect, and this characteristic enhances the reliability of the screening results. We establish the sure screening property of the proposed method, which ensures that with a high probability, the true model is included in the selected model. Simulation studies are conducted to evaluate the finite sample performance of the proposed method, and an application to a real dataset showcases its practical utility.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108203"},"PeriodicalIF":1.5,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantile Super Learning for independent and online settings with application to solar power forecasting","authors":"Herbert Susmann , Antoine Chambaz","doi":"10.1016/j.csda.2025.108202","DOIUrl":"10.1016/j.csda.2025.108202","url":null,"abstract":"<div><div>Estimating quantiles of an outcome conditional on covariates is of fundamental interest in statistics with broad application in probabilistic prediction and forecasting. An ensemble method for conditional quantile estimation is proposed, Quantile Super Learning, that combines predictions from multiple candidate algorithms based on their empirical performance measured with respect to a cross-validated empirical risk of the quantile loss function. Theoretical guarantees for both i.i.d. and online data scenarios are presented. The performance of <em>this</em> approach for quantile estimation and in forming prediction intervals is tested in simulation studies. Two case studies related to solar energy are used to illustrate Quantile Super Learning: in an i.i.d. setting, we predict the physical properties of perovskite materials for photovoltaic cells, and in an online setting we forecast ground solar irradiance based on output from dynamic weather ensemble models.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108202"},"PeriodicalIF":1.5,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinran Zhang , Xiaohui Yuan , Chunjie Wang , Xinyuan Song
{"title":"Monotone composite quantile regression neural network for censored data with a cure fraction","authors":"Xinran Zhang , Xiaohui Yuan , Chunjie Wang , Xinyuan Song","doi":"10.1016/j.csda.2025.108201","DOIUrl":"10.1016/j.csda.2025.108201","url":null,"abstract":"<div><div>The cure rate monotone composite quantile regression neural network model is investigated as an extension of the cure rate quantile model. It can uncover complex nonlinear relationships and effectively ensure the non-crossing of quantile predictions. An iterative algorithm coupled with data augmentation is developed to predict the survival time of susceptible subjects and the cure rate among all subjects. Simulation studies indicate that the proposed approach exhibits advantages in prediction over traditional statistical methods in finite samples when nonlinearity exists between response and predictors. The analysis of two real datasets further validates the utility of the proposed method.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108201"},"PeriodicalIF":1.5,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143935576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niek G.P. Den Teuling , Francesco Ungolo , Steffen C. Pauws , Edwin R. van den Heuvel
{"title":"Latent-class trajectory modeling with a heterogeneous mean-variance relation","authors":"Niek G.P. Den Teuling , Francesco Ungolo , Steffen C. Pauws , Edwin R. van den Heuvel","doi":"10.1016/j.csda.2025.108199","DOIUrl":"10.1016/j.csda.2025.108199","url":null,"abstract":"<div><div>The benefit of addressing heteroskedastic residual variances across trajectories is investigated with the purpose of finding clusters of longitudinal trajectories. Models are proposed to account for class-specific heteroskedasticity through a mean-variance relation or random residual variance, thereby accounting for trajectory-specific variance. The analyzed latent-class trajectory models are an extension of growth mixture models (GMM). The estimation bias of the model parameters and the recoverability of the number of latent classes are assessed under various data-generating models and settings by means of a simulation study. Furthermore, the empirical applicability of these models is demonstrated through the analysis of the time-varying incidence rate of COVID-19 cases across counties in the United States. Overall, the class-specific mean-variance could be reliably estimated by the proposed models in datasets comprising 250 trajectories. In addition, the extended GMM accounting for the residual random variance showed improved group trajectory estimation over the standard GMM.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108199"},"PeriodicalIF":1.5,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143904339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A goodness-of-fit test for geometric Brownian motion","authors":"Daniel Gaigall , Philipp Wübbolding","doi":"10.1016/j.csda.2025.108196","DOIUrl":"10.1016/j.csda.2025.108196","url":null,"abstract":"<div><div>A new goodness-of-fit test for the composite null hypothesis that data originate from a geometric Brownian motion is studied in the functional data setting. This is equivalent to testing if the data are from a scaled Brownian motion with linear drift. Critical values for the test are obtained, ensuring that the specified significance level is achieved in finite samples. The asymptotic behavior of the test statistic under the null distribution and alternatives is studied, and it is also demonstrated that the test is consistent. Furthermore, the proposed approach offers advantages in terms of fast and simple implementation. A comprehensive simulation study shows that the power of the new test compares favorably to that of existing methods. A key application is the assessment of financial time series for the suitability of the Black-Scholes model. Examples relating to various stock and interest rate time series are presented in order to illustrate the proposed test.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108196"},"PeriodicalIF":1.5,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143868689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A simultaneous confidence-bounded true discovery proportion perspective on localizing differences in smooth terms in regression models","authors":"David Swanson","doi":"10.1016/j.csda.2025.108197","DOIUrl":"10.1016/j.csda.2025.108197","url":null,"abstract":"<div><div>A method is demonstrated for localizing where two spline terms, or smooths, differ using a true discovery proportion (TDP)-based interpretation. The procedure yields a statement on the proportion of some region where true differences exist between two smooths. The methodology avoids ad hoc approaches to making such statements, like subsetting the data and performing hypothesis tests on the truncated spline terms. TDP estimates are 1-<em>α</em> confidence-bounded simultaneously, which means that a region's TDP estimate is a lower bound on the proportion of actual differences, or true discoveries, in that region, with high confidence regardless of the number of estimates made. The procedure is based on closed-testing using Simes local test. This local test requires that the multivariate <span><math><msup><mrow><mi>χ</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> test statistics of generalized Wishart type underlying the method be positive regression dependent on subsets (PRDS), a result for which evidence is presented suggesting that the condition holds. Consistency of the procedure is demonstrated for generalized additive models with the tuning parameter chosen by REML or GCV, and the achievement of confidence-bounded TDP is shown in simulation as is an analysis of walking gait.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108197"},"PeriodicalIF":1.5,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143906892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joshua Tobin , Michaela Black , James Ng , Debbie Rankin , Jonathan Wallace , Catherine Hughes , Leane Hoey , Adrian Moore , Jinling Wang , Geraldine Horigan , Paul Carlin , Helene McNulty , Anne M. Molloy , Mimi Zhang
{"title":"Co-clustering multi-view data using the Latent Block Model","authors":"Joshua Tobin , Michaela Black , James Ng , Debbie Rankin , Jonathan Wallace , Catherine Hughes , Leane Hoey , Adrian Moore , Jinling Wang , Geraldine Horigan , Paul Carlin , Helene McNulty , Anne M. Molloy , Mimi Zhang","doi":"10.1016/j.csda.2025.108188","DOIUrl":"10.1016/j.csda.2025.108188","url":null,"abstract":"<div><div>The Latent Block Model (LBM) is a prominent model-based co-clustering method, returning parametric representations of each block-cluster and allowing the use of well-grounded model selection methods. Although the LBM has been adapted to accommodate various feature types, it cannot be applied to datasets consisting of multiple distinct sets of features, termed views, for a common set of observations. The multi-view LBM is introduced herein, extending the LBM method to multi-view data, where each view marginally follows an LBM. For any pair of two views, the dependence between them is captured by a row-cluster membership matrix. A likelihood-based approach is formulated for parameter estimation, harnessing a stochastic EM algorithm merged with a Gibbs sampler, while an ICL criterion is formulated to determine the number of row- and column-clusters in each view. To justify the application of the multi-view approach, hypothesis tests are formulated to evaluate the independence of row-clusters across views, with the testing procedure seamlessly integrated into the estimation framework. A penalty scheme is also introduced to induce sparsity in row-clusterings. The algorithm's performance is validated using synthetic and real-world datasets, accompanied by recommendations for optimal parameter selection. Finally, the multi-view co-clustering method is applied to a complex genomics dataset, and is shown to provide new insights for high-dimension multi-view problems.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108188"},"PeriodicalIF":1.5,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143868688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Angelika Silbernagel , Christian H. Weiß , Alexander Schnurr
{"title":"Non-parametric tests for cross-dependence based on multivariate extensions of ordinal patterns","authors":"Angelika Silbernagel , Christian H. Weiß , Alexander Schnurr","doi":"10.1016/j.csda.2025.108189","DOIUrl":"10.1016/j.csda.2025.108189","url":null,"abstract":"<div><div>Analyzing the cross-dependence within sequentially observed pairs of random variables is an interesting mathematical problem that also has several practical applications. Most of the time, classical dependence measures like Pearson's correlation are used to this end. This quantity, however, only measures linear dependence and has other drawbacks as well. Different concepts for measuring cross-dependence in sequentially observed random vectors, which are based on so-called ordinal patterns or multivariate generalizations of them, are described. In all cases, limiting distributions of the corresponding test statistics are derived. In a simulation study, the performance of these statistics is compared with three competitors, namely, classical Pearson's and Spearman's correlation as well as the rank-based Chatterjee's correlation coefficient. The applicability of the test statistics is illustrated by using them on two real-world data examples.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108189"},"PeriodicalIF":1.5,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143814833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A flexible mixed-membership model for community and enterotype detection for microbiome data","authors":"Alice Giampino, Roberto Ascari, Sonia Migliorati","doi":"10.1016/j.csda.2025.108181","DOIUrl":"10.1016/j.csda.2025.108181","url":null,"abstract":"<div><div>Understanding how the human gut microbiome affects host health is challenging due to the wide interindividual variability, sparsity, and high dimensionality of microbiome data. Mixed-membership models have been previously applied to these data to detect latent communities of bacterial taxa that are expected to co-occur. The most widely used mixed-membership model is latent Dirichlet allocation (LDA). However, LDA is limited by the rigidity of the Dirichlet distribution imposed on the community proportions, which hinders its ability to model dependencies and account for overdispersion. To address this limitation, a generalization of LDA is proposed that introduces greater flexibility into the covariance matrix by incorporating the flexible Dirichlet (FD), a specific identifiable mixture with Dirichlet components. In addition to identifying communities, the new model enables the detection of enterotypes, i.e., clusters of samples with similar microbe composition. For inferential purposes, a computationally efficient collapsed Gibbs sampler that exploits the conjugacy of the FD distribution with respect to the multinomial model is proposed. A simulation study demonstrates the model's ability to accurately recover true parameter values by minimizing appropriate compositional discrepancy measures between the true and estimated values. Additionally, the model correctly identifies the number of communities, as evidenced by perplexity scores. Moreover, an application to the COMBO dataset highlights its effectiveness in detecting biologically significant and coherent communities and enterotypes, revealing a broader range of correlations between community abundances. These results underscore the new model as a definite improvement over LDA.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108181"},"PeriodicalIF":1.5,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}