Statistics and ComputingPub Date : 2025-02-01Epub Date: 2024-11-16DOI: 10.1007/s11222-024-10526-1
Qiang Heng, Kenneth Lange
{"title":"Bootstrap estimation of the proportion of outliers in robust regression.","authors":"Qiang Heng, Kenneth Lange","doi":"10.1007/s11222-024-10526-1","DOIUrl":"https://doi.org/10.1007/s11222-024-10526-1","url":null,"abstract":"<p><p>This paper presents a nonparametric bootstrap method for estimating the proportions of inliers and outliers in robust regression models. Our approach is based on the concept of stability, providing robustness against distributional assumptions and eliminating the need for pre-specified confidence levels. Through numerical experiments, we demonstrate that this method yields more accurate and stable estimates than existing alternatives. Additionally, the generated instability paths offer a valuable graphical tool for understanding the inlier and outlier distributions within the data. The method naturally extends to generalized linear models, where we find that variance-stabilizing transformations produce residuals that are well-suited for outlier detection. Applications to two real-world datasets further illustrate the practical utility of our approach in identifying outliers.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12077844/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144080117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistics and ComputingPub Date : 2025-01-01Epub Date: 2025-02-25DOI: 10.1007/s11222-025-10584-z
Lorenzo Rimella, Chris Jewell, Paul Fearnhead
{"title":"Simulation based composite likelihood.","authors":"Lorenzo Rimella, Chris Jewell, Paul Fearnhead","doi":"10.1007/s11222-025-10584-z","DOIUrl":"10.1007/s11222-025-10584-z","url":null,"abstract":"<p><p>Inference for high-dimensional hidden Markov models is challenging due to the exponential-in-dimension computational cost of calculating the likelihood. To address this issue, we introduce an innovative composite likelihood approach called \"Simulation Based Composite Likelihood\" (SimBa-CL). With SimBa-CL, we approximate the likelihood by the product of its marginals, which we estimate using Monte Carlo sampling. In a similar vein to approximate Bayesian computation (ABC), SimBa-CL requires multiple simulations from the model, but, in contrast to ABC, it provides a likelihood approximation that guides the optimization of the parameters. Leveraging automatic differentiation libraries, it is simple to calculate gradients and Hessians to not only speed up optimization but also to build approximate confidence sets. We present extensive empirical results which validate our theory and demonstrate its advantage over SMC, and apply SimBa-CL to real-world Aphtovirus data.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10584-z.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 3","pages":"58"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11861035/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143524490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistics and ComputingPub Date : 2025-01-01Epub Date: 2025-05-17DOI: 10.1007/s11222-025-10624-8
Sehwan Kim, Faming Liang
{"title":"Extended fiducial inference for individual treatment effects via deep neural networks.","authors":"Sehwan Kim, Faming Liang","doi":"10.1007/s11222-025-10624-8","DOIUrl":"10.1007/s11222-025-10624-8","url":null,"abstract":"<p><p>Individual treatment effect estimation has gained significant attention in recent data science literature. This work introduces the Double Neural Network (Double-NN) method to address this problem within the framework of extended fiducial inference (EFI). In the proposed method, deep neural networks are used to model the treatment and control effect functions, while an additional neural network is employed to estimate their parameters. The universal approximation capability of deep neural networks ensures the broad applicability of this method. Numerical results highlight the superior performance of the proposed Double-NN method compared to the conformal quantile regression (CQR) method in individual treatment effect estimation. From the perspective of statistical inference, this work advances the theory and methodology for statistical inference of large models. Specifically, it is theoretically proven that the proposed method permits the model size to increase with the sample size <i>n</i> at a rate of <math><mrow><mi>O</mi> <mo>(</mo> <msup><mi>n</mi> <mi>ζ</mi></msup> <mo>)</mo></mrow> </math> for some <math><mrow><mn>0</mn> <mo>≤</mo> <mi>ζ</mi> <mo><</mo> <mn>1</mn></mrow> </math> , while still maintaining proper quantification of uncertainty in the model parameters. This result marks a significant improvement compared to the range <math><mrow><mn>0</mn> <mo>≤</mo> <mi>ζ</mi> <mo><</mo> <mfrac><mn>1</mn> <mn>2</mn></mfrac> </mrow> </math> required by the classical central limit theorem. Furthermore, this work provides a rigorous framework for quantifying the uncertainty of deep neural networks under the neural scaling law, representing a substantial contribution to the statistical understanding of large-scale neural network models.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10624-8.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 4","pages":"97"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12085359/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144102739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistics and ComputingPub Date : 2025-01-01Epub Date: 2025-04-03DOI: 10.1007/s11222-025-10606-w
Joshua Corneck, Edward A K Cohen, James S Martin, Francesco Sanna Passino
{"title":"Online Bayesian changepoint detection for network Poisson processes with community structure.","authors":"Joshua Corneck, Edward A K Cohen, James S Martin, Francesco Sanna Passino","doi":"10.1007/s11222-025-10606-w","DOIUrl":"10.1007/s11222-025-10606-w","url":null,"abstract":"<p><p>Network point processes often exhibit latent structure that govern the behaviour of the sub-processes. It is not always reasonable to assume that this latent structure is static, and detecting when and how this driving structure changes is often of interest. In this paper, we introduce a novel online methodology for detecting changes within the latent structure of a network point process. We focus on block-homogeneous Poisson processes, where latent node memberships determine the rates of the edge processes. We propose a scalable variational procedure which can be applied on large networks in an online fashion via a Bayesian forgetting factor applied to sequential variational approximations to the posterior distribution. The proposed framework is tested on simulated and real-world data, and it rapidly and accurately detects changes to the latent edge process rates, and to the latent node group memberships, both in an online manner. In particular, in an application on the Santander Cycles bike-sharing network in central London, we detect changes within the network related to holiday periods and lockdown restrictions between 2019 and 2020.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 3","pages":"75"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11968509/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143796525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistics and ComputingPub Date : 2025-01-01Epub Date: 2025-02-20DOI: 10.1007/s11222-025-10582-1
Timofei Biziaev, Karen Kopciuk, Thierry Chekouo
{"title":"Using prior-data conflict to tune Bayesian regularized regression models.","authors":"Timofei Biziaev, Karen Kopciuk, Thierry Chekouo","doi":"10.1007/s11222-025-10582-1","DOIUrl":"10.1007/s11222-025-10582-1","url":null,"abstract":"<p><p>In high-dimensional regression models, variable selection becomes challenging from a computational and theoretical perspective. Bayesian regularized regression via shrinkage priors like the Laplace or spike-and-slab prior are effective methods for variable selection in <math><mrow><mi>p</mi> <mo>></mo> <mi>n</mi></mrow> </math> scenarios provided the shrinkage priors are configured adequately. We propose an empirical Bayes configuration using checks for prior-data conflict: tests that assess whether there is disagreement in parameter information provided by the prior and data. We apply our proposed method to the Bayesian LASSO and spike-and-slab shrinkage priors in the linear regression model and assess the variable selection performance of our prior configurations through a high-dimensional simulation study. Additionally, we apply our method to proteomic data collected from patients admitted to the Albany Medical Center in Albany NY in April of 2020 with COVID-like respiratory issues. Simulation results suggest our proposed configurations may outperform competing models when the true regression effects are small.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10582-1.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 2","pages":"53"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11842445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143484027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistics and ComputingPub Date : 2025-01-01Epub Date: 2025-03-16DOI: 10.1007/s11222-025-10600-2
Joseph Rilling, Cheng Yong Tang
{"title":"A new <i>p</i>-value based multiple testing procedure for generalized linear models.","authors":"Joseph Rilling, Cheng Yong Tang","doi":"10.1007/s11222-025-10600-2","DOIUrl":"10.1007/s11222-025-10600-2","url":null,"abstract":"<p><p>This study introduces a novel <i>p</i>-value-based multiple testing approach tailored for generalized linear models. Despite the crucial role of generalized linear models in statistics, existing methodologies face obstacles arising from the heterogeneous variance of response variables and complex dependencies among estimated parameters. Our aim is to address the challenge of controlling the false discovery rate (FDR) amidst arbitrarily dependent test statistics. Through the development of efficient computational algorithms, we present a versatile statistical framework for multiple testing. The proposed framework accommodates a range of tools developed for constructing a new model matrix in regression-type analysis, including random row permutations and Model-X knockoffs. We devise efficient computing techniques to solve the encountered non-trivial quadratic matrix equations, enabling the construction of paired <i>p</i>-values suitable for the two-step multiple testing procedure proposed by Sarkar and Tang (Biometrika 109(4): 1149-1155, 2022). Theoretical analysis affirms the properties of our approach, demonstrating its capability to control the FDR at a given level. Empirical evaluations further substantiate its promising performance across diverse simulation settings.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-025-10600-2.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 3","pages":"69"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11911269/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143658683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimation and model selection for finite mixtures of Tukey's <i>g</i>- &-<i>h</i> distributions.","authors":"Tingting Zhan, Misung Yi, Amy R Peck, Hallgeir Rui, Inna Chervoneva","doi":"10.1007/s11222-025-10596-9","DOIUrl":"10.1007/s11222-025-10596-9","url":null,"abstract":"<p><p>A finite mixture of distributions is a popular statistical model, which is especially meaningful when the population of interest may include distinct subpopulations. This work is motivated by analysis of protein expression levels quantified using immunofluorescence immunohistochemistry assays of human tissues. The distributions of cellular protein expression levels in a tissue often exhibit multimodality, skewness and heavy tails, but there is a substantial variability between distributions in different tissues from different subjects, while some of these mixture distributions include components consistent with the assumption of a normal distribution. To accommodate such diversity, we propose a mixture of 4-parameter Tukey's <i>g</i>- &-<i>h</i> distributions for fitting finite mixtures with both Gaussian and non-Gaussian components. Tukey's <i>g</i>- &-<i>h</i> distribution is a flexible model that allows variable degree of skewness and kurtosis in mixture components, including normal distribution as a particular case. Since the likelihood of the Tukey's <i>g</i>- &-<i>h</i> mixtures does not have a closed analytical form, we propose a quantile least Mahalanobis distance (QLMD) estimator for parameters of such mixtures. QLMD is an indirect estimator minimizing the Mahalanobis distance between the sample and model-based quantiles, and its asymptotic properties follow from the general theory of indirect estimation. We have developed a stepwise algorithm to select a parsimonious Tukey's <i>g</i>- &-<i>h</i> mixture model and implemented all proposed methods in the R package QuantileGH available on CRAN. A simulation study was conducted to evaluate performance of the Tukey's <i>g</i>- &-<i>h</i> mixtures and compare to performance of mixtures of skew-normal or skew-<i>t</i> distributions. The Tukey's <i>g</i>- &-<i>h</i> mixtures were applied to model cellular expressions of Cyclin D1 protein in breast cancer tissues, and resulting parameter estimates evaluated as predictors of progression-free survival.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 3","pages":"67"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11910465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143650810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistics and ComputingPub Date : 2025-01-01Epub Date: 2024-12-10DOI: 10.1007/s11222-024-10537-y
Jacopo Di Iorio, Marzia A Cremona, Francesca Chiaromonte
{"title":"funBIalign: a hierachical algorithm for functional motif discovery based on mean squared residue scores.","authors":"Jacopo Di Iorio, Marzia A Cremona, Francesca Chiaromonte","doi":"10.1007/s11222-024-10537-y","DOIUrl":"10.1007/s11222-024-10537-y","url":null,"abstract":"<p><p>Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical \"shapes\" or \"patterns\" that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose <i>funBIalign</i> for their discovery and evaluation. Inspired by clustering and biclustering techniques, <i>funBIalign</i> is a multi-step procedure which uses agglomerative hierarchical clustering with complete linkage and a functional distance based on mean squared residue scores to discover functional motifs, both in a single curve (e.g., time series) and in a set of curves. We assess its performance and compare it to other recent methods through extensive simulations. Moreover, we use <i>funBIalign</i> for discovering motifs in two real-data case studies; one on food price inflation and one on temperature changes.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s11222-024-10537-y.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"35 1","pages":"11"},"PeriodicalIF":1.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632007/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142819226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas
{"title":"Hidden Markov models for multivariate panel data","authors":"Mackenzie R. Neal, Alexa A. Sochaniwsky, Paul D. McNicholas","doi":"10.1007/s11222-024-10462-0","DOIUrl":"https://doi.org/10.1007/s11222-024-10462-0","url":null,"abstract":"<p>While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms because they are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the issues that arise in panel data. A modified expectation–maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"20 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerated failure time models with error-prone response and nonlinear covariates","authors":"Li-Pang Chen","doi":"10.1007/s11222-024-10491-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10491-9","url":null,"abstract":"<p>As a specific application of survival analysis, one of main interests in medical studies aims to analyze the patients’ survival time of a specific cancer. Typically, gene expressions are treated as covariates to characterize the survival time. In the framework of survival analysis, the accelerated failure time model in the parametric form is perhaps a common approach. However, gene expressions are possibly nonlinear and the survival time as well as censoring status are subject to measurement error. In this paper, we aim to tackle those complex features simultaneously. We first correct for measurement error in survival time and censoring status, and use them to develop a corrected Buckley–James estimator. After that, we use the boosting algorithm with the cubic spline estimation method to iteratively recover nonlinear relationship between covariates and survival time. Theoretically, we justify the validity of measurement error correction and estimation procedure. Numerical studies show that the proposed method improves the performance of estimation and is able to capture informative covariates. The methodology is primarily used to analyze the breast cancer data provided by the Netherlands Cancer Institute for research.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"19 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}