Statistics and ComputingPub Date : 2025-01-01Epub Date: 2024-12-10DOI: 10.1007/s11222-024-10537-y
Jacopo Di Iorio, Marzia A Cremona, Francesca Chiaromonte
{"title":"funBIalign: a hierachical algorithm for functional motif discovery based on mean squared residue scores.","authors":"Jacopo Di Iorio, Marzia A Cremona, Francesca Chiaromonte","doi":"10.1007/s11222-024-10537-y","DOIUrl":"10.1007/s11222-024-10537-y","url":null,"abstract":"<p><p>Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical \"shapes\" or \"patterns\" that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose <i>funBIalign</i> for their discovery and evaluation. Inspired by clustering and biclustering techniques, <i>funBIalign</i> is a multi-step procedure which uses agglomerative hierarchical clustering with complete linkage and a functional distance based on mean squared residue scores to discover functional motifs, both in a single curve (e.g., time series) and in a set of curves. We assess its performance and compare it to other recent methods through extensive simulations. Moreover, we use <i>funBIalign</i> for discovering motifs in two real-data case studies; one on food price inflation and one on temperature changes.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-024-10537-y.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 1","pages":"11"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632007/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142819226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas
{"title":"Hidden Markov models for multivariate panel data","authors":"Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas","doi":"10.1007/s11222-024-10462-0","DOIUrl":"https://doi.org/10.1007/s11222-024-10462-0","url":null,"abstract":"<p>While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms because they are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the issues that arise in panel data. A modified expectation–maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"20 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerated failure time models with error-prone response and nonlinear covariates","authors":"Li-Pang Chen","doi":"10.1007/s11222-024-10491-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10491-9","url":null,"abstract":"<p>As a specific application of survival analysis, one of main interests in medical studies aims to analyze the patients’ survival time of a specific cancer. Typically, gene expressions are treated as covariates to characterize the survival time. In the framework of survival analysis, the accelerated failure time model in the parametric form is perhaps a common approach. However, gene expressions are possibly nonlinear and the survival time as well as censoring status are subject to measurement error. In this paper, we aim to tackle those complex features simultaneously. We first correct for measurement error in survival time and censoring status, and use them to develop a corrected Buckley–James estimator. After that, we use the boosting algorithm with the cubic spline estimation method to iteratively recover nonlinear relationship between covariates and survival time. Theoretically, we justify the validity of measurement error correction and estimation procedure. Numerical studies show that the proposed method improves the performance of estimation and is able to capture informative covariates. The methodology is primarily used to analyze the breast cancer data provided by the Netherlands Cancer Institute for research.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"19 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sequential model identification with reversible jump ensemble data assimilation method","authors":"Yue Huan, Hai Xiang Lin","doi":"10.1007/s11222-024-10499-1","DOIUrl":"https://doi.org/10.1007/s11222-024-10499-1","url":null,"abstract":"<p>In data assimilation (DA) schemes, the form representing the processes in the evolution models are pre-determined except some parameters to be estimated. In some applications, such as the contaminant solute transport model and the gas reservoir model, the modes in the equations within the evolution model cannot be predetermined from the outset and may change with the time. We propose a framework of sequential DA method named Reversible Jump Ensemble Filter (RJEnF) to identify the governing modes of the evolution model over time. The main idea is to introduce the Reversible Jump Markov Chain Monte Carlo (RJMCMC) method to the DA schemes to fit the situation where the modes of the evolution model are unknown and the dimension of the parameters is changing. Our framework allows us to identify the modes in the evolution model and their changes, as well as estimate the parameters and states of the dynamic system. Numerical experiments are conducted and the results show that our framework can effectively identify the underlying evolution models and increase the predictive accuracy of DA methods.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"94 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Shrinkage for extreme partial least-squares","authors":"Julyan Arbel, Stéphane Girard, Hadrien Lorenzo","doi":"10.1007/s11222-024-10490-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10490-w","url":null,"abstract":"<p>This work focuses on dimension-reduction techniques for modelling conditional extreme values. Specifically, we investigate the idea that extreme values of a response variable can be explained by nonlinear functions derived from linear projections of an input random vector. In this context, the estimation of projection directions is examined, as approached by the extreme partial least squares (EPLS) method—an adaptation of the original partial least squares (PLS) method tailored to the extreme-value framework. Further, a novel interpretation of EPLS directions as maximum likelihood estimators is introduced, utilizing the von Mises–Fisher distribution applied to hyperballs. The dimension reduction process is enhanced through the Bayesian paradigm, enabling the incorporation of prior information into the projection direction estimation. The maximum a posteriori estimator is derived in two specific cases, elucidating it as a regularization or shrinkage of the EPLS estimator. We also establish its asymptotic behavior as the sample size approaches infinity. A simulation data study is conducted in order to assess the practical utility of our proposed method. This clearly demonstrates its effectiveness even in moderate data problems within high-dimensional settings. Furthermore, we provide an illustrative example of the method’s applicability using French farm income data, highlighting its efficacy in real-world scenarios.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"205 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nonconvex Dantzig selector and its parallel computing algorithm","authors":"Jiawei Wen, Songshan Yang, Delin Zhao","doi":"10.1007/s11222-024-10492-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10492-8","url":null,"abstract":"<p>The Dantzig selector is a popular <span>(ell _1)</span>-type variable selection method widely used across various research fields. However, <span>(ell _1)</span>-type methods may not perform well for variable selection without complex irrepresentable conditions. In this article, we introduce a nonconvex Dantzig selector for ultrahigh-dimensional linear models. We begin by demonstrating that the oracle estimator serves as a local optimum for the nonconvex Dantzig selector. In addition, we propose a one-step local linear approximation estimator, called the Dantzig-LLA estimator, for the nonconvex Dantzig selector, and establish its strong oracle property. The proposed regularization method avoids the restrictive conditions imposed by <span>(ell _1)</span> regularization methods to guarantee the model selection consistency. Furthermore, we propose an efficient and parallelizable computing algorithm based on feature-splitting to address the computational challenges associated with the nonconvex Dantzig selector in high-dimensional settings. A comprehensive numerical study is conducted to evaluate the performance of the nonconvex Dantzig selector and the computing efficiency of the feature-splitting algorithm. The results demonstrate that the Dantzig selector with nonconvex penalty outperforms the <span>(ell _1)</span> penalty-based selector, and the feature-splitting algorithm performs well in high-dimensional settings where linear programming solver may fail. Finally, we generalize the concept of nonconvex Dantzig selector to deal with more general loss functions.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"1 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust singular value decomposition with application to video surveillance background modelling","authors":"Subhrajyoty Roy, Abhik Ghosh, Ayanendranath Basu","doi":"10.1007/s11222-024-10493-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10493-7","url":null,"abstract":"<p>The traditional method of computing singular value decomposition (SVD) of a data matrix is based on the least squares principle and is, therefore, very sensitive to the presence of outliers. Hence, the resulting inferences across different applications using the classical SVD are extremely degraded in the presence of data contamination. In particular, background modelling of video surveillance data in the presence of camera tampering cannot be reliably solved by the classical SVD. In this paper, we propose a novel robust singular value decomposition technique based on the popular minimum density power divergence estimator. We have established the theoretical properties of the proposed estimator such as convergence, equivariance and consistency under the high-dimensional regime where both the row and column dimensions of the data matrix approach infinity. We also propose a fast and scalable algorithm based on alternating weighted regression to obtain the estimate. Within the scope of our fairly extensive simulation studies, our method performs better than existing robust SVD algorithms. Finally, we present an application of the proposed method on the video surveillance background modelling problem.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"39 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal confidence interval for the difference between proportions","authors":"Almog Peer, David Azriel","doi":"10.1007/s11222-024-10485-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10485-7","url":null,"abstract":"<p>Estimating the probability of the binomial distribution is a basic problem, which appears in almost all introductory statistics courses and is performed frequently in various studies. In some cases, the parameter of interest is a difference between two probabilities, and the current work studies the construction of confidence intervals for this parameter when the sample size is small. Our goal is to find the shortest confidence intervals under the constraint of coverage probability being at least as large as a predetermined level. For the two-sample case, there is no known algorithm that achieves this goal, but different heuristics procedures have been suggested, and the present work aims at finding optimal confidence intervals. In the one-sample case, there is a known algorithm that finds optimal confidence intervals presented by Blyth and Still (J Am Stat Assoc 78(381):108–116, 1983). It is based on solving small and local optimization problems and then using an inversion step to find the global optimum solution. We show that this approach fails in the two-sample case and therefore, in order to find optimal confidence intervals, one needs to solve a global optimization problem, rather than small and local ones, which is computationally much harder. We present and discuss the suitable global optimization problem. Using the Gurobi package we find near-optimal solutions when the sample sizes are smaller than 15, and we compare these solutions to some existing methods, both approximate and exact. We find that the improvement in terms of lengths with respect to the best competitor varies between 1.5 and 5% for different parameters of the problem. Therefore, we recommend the use of the new confidence intervals when both sample sizes are smaller than 15. Tables of the confidence intervals are given in the Excel file in this link (https://technionmail-my.sharepoint.com/:f:/g/personal/ap_campus_technion_ac_il/El-213Kms51BhQxR8MmQJCYBDfIsvtrK9mQIey1sZnZWIQ?e=hxGunl).</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"9 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huiling Liu, Xinmin Li, Feifei Chen, Wolfgang Härdle, Hua Liang
{"title":"A comprehensive comparison of goodness-of-fit tests for logistic regression models","authors":"Huiling Liu, Xinmin Li, Feifei Chen, Wolfgang Härdle, Hua Liang","doi":"10.1007/s11222-024-10487-5","DOIUrl":"https://doi.org/10.1007/s11222-024-10487-5","url":null,"abstract":"<p>We introduce a projection-based test for assessing logistic regression models using the empirical residual marked empirical process and suggest a model-based bootstrap procedure to calculate critical values. We comprehensively compare this test and Stute and Zhu’s test with several commonly used goodness-of-fit (GoF) tests: the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, Osius–Rojek test, and Stukel test for logistic regression models in terms of type I error control and power performance in small (<span>(n=50)</span>), moderate (<span>(n=100)</span>), and large (<span>(n=500)</span>) sample sizes. We assess the power performance for two commonly encountered situations: nonlinear and interaction departures from the null hypothesis. All tests except the modified Hosmer–Lemeshow test and Osius–Rojek test have the correct size in all sample sizes. The power performance of the projection based test consistently outperforms its competitors. We apply these tests to analyze an AIDS dataset and a cancer dataset. For the former, all tests except the projection-based test do not reject a simple linear function in the logit, which has been illustrated to be deficient in the literature. For the latter dataset, the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, and Osius–Rojek test fail to detect the quadratic form in the logit, which was detected by the Stukel test, Stute and Zhu’s test, and the projection-based test.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"4 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"New forest-based approaches for sufficient dimension reduction","authors":"Shuang Dai, Ping Wu, Zhou Yu","doi":"10.1007/s11222-024-10482-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10482-w","url":null,"abstract":"<p>Sufficient dimension reduction (SDR) primarily aims to reduce the dimensionality of high-dimensional predictor variables while retaining essential information about the responses. Traditional SDR methods typically employ kernel weighting functions, which unfortunately makes them susceptible to the curse of dimensionality. To address this issue, we in this paper propose novel forest-based approaches for SDR that utilize a locally adaptive kernel generated by Mondrian forests. Overall, our work takes the perspective of Mondrian forest as an adaptive weighted kernel technique for SDR problems. In the central mean subspace model, by integrating the methods from Xia et al. (J R Stat Soc Ser B (Stat Methodol) 64(3):363–410, 2002. https://doi.org/10.1111/1467-9868.03411) with Mondrian forest weights, we suggest the forest-based outer product of gradients estimation (mf-OPG) and the forest-based minimum average variance estimation (mf-MAVE). Moreover, we substitute the kernels used in nonparametric density function estimations (Xia in Ann Stat 35(6):2654–2690, 2007. https://doi.org/10.1214/009053607000000352), targeting the central subspace, with Mondrian forest weights. These techniques are referred to as mf-dOPG and mf-dMAVE, respectively. Under regularity conditions, we establish the asymptotic properties of our forest-based estimators, as well as the convergence of the affiliated algorithms. Through simulation studies and analysis of fully observable data, we demonstrate substantial improvements in computational efficiency and predictive accuracy of our proposals compared with the traditional counterparts.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"57 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}