Parth T. Nobel, Daniel LeJeune, Emmanuel J. Candès
{"title":"RandALO: Out-of-sample risk estimation in no time flat","authors":"Parth T. Nobel, Daniel LeJeune, Emmanuel J. Candès","doi":"arxiv-2409.09781","DOIUrl":"https://doi.org/arxiv-2409.09781","url":null,"abstract":"Estimating out-of-sample risk for models trained on large high-dimensional\u0000datasets is an expensive but essential part of the machine learning process,\u0000enabling practitioners to optimally tune hyperparameters. Cross-validation (CV)\u0000serves as the de facto standard for risk estimation but poorly trades off high\u0000bias ($K$-fold CV) for computational cost (leave-one-out CV). We propose a\u0000randomized approximate leave-one-out (RandALO) risk estimator that is not only\u0000a consistent estimator of risk in high dimensions but also less computationally\u0000expensive than $K$-fold CV. We support our claims with extensive simulations on\u0000synthetic and real data and provide a user-friendly Python package implementing\u0000RandALO available on PyPI as randalo and at https://github.com/cvxgrp/randalo.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edoardo Calvello, Pierre Monmarché, Andrew M. Stuart, Urbain Vaes
{"title":"Accuracy of the Ensemble Kalman Filter in the Near-Linear Setting","authors":"Edoardo Calvello, Pierre Monmarché, Andrew M. Stuart, Urbain Vaes","doi":"arxiv-2409.09800","DOIUrl":"https://doi.org/arxiv-2409.09800","url":null,"abstract":"The filtering distribution captures the statistics of the state of a\u0000dynamical system from partial and noisy observations. Classical particle\u0000filters provably approximate this distribution in quite general settings;\u0000however they behave poorly for high dimensional problems, suffering weight\u0000collapse. This issue is circumvented by the ensemble Kalman filter which is an\u0000equal-weight interacting particle system. However, this finite particle system\u0000is only proven to approximate the true filter in the linear Gaussian case. In\u0000practice, however, it is applied in much broader settings; as a result,\u0000establishing its approximation properties more generally is important. There\u0000has been recent progress in the theoretical analysis of the algorithm,\u0000establishing stability and error estimates in non-Gaussian settings, but the\u0000assumptions on the dynamics and observation models rule out the unbounded\u0000vector fields that arise in practice and the analysis applies only to the mean\u0000field limit of the ensemble Kalman filter. The present work establishes error\u0000bounds between the filtering distribution and the finite particle ensemble\u0000Kalman filter when the model exhibits linear growth.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Asymptotics for irregularly observed long memory processes","authors":"Mohamedou Ould-Haye, Anne Philippe","doi":"arxiv-2409.09498","DOIUrl":"https://doi.org/arxiv-2409.09498","url":null,"abstract":"We study the effect of observing a stationary process at irregular time\u0000points via a renewal process. We establish a sharp difference in the asymptotic\u0000behaviour of the self-normalized sample mean of the observed process depending\u0000on the renewal process. In particular, we show that if the renewal process has\u0000a moderate heavy tail distribution then the limit is a so-called Normal\u0000Variance Mixture (NVM) and we characterize the randomized variance part of the\u0000limiting NVM as an integral function of a L'evy stable motion. Otherwise, the\u0000normalized sample mean will be asymptotically normal.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Statistical Viewpoint on Differential Privacy: Hypothesis Testing, Representation and Blackwell's Theorem","authors":"Weijie J. Su","doi":"arxiv-2409.09558","DOIUrl":"https://doi.org/arxiv-2409.09558","url":null,"abstract":"Differential privacy is widely considered the formal privacy for\u0000privacy-preserving data analysis due to its robust and rigorous guarantees,\u0000with increasingly broad adoption in public services, academia, and industry.\u0000Despite originating in the cryptographic context, in this review paper we argue\u0000that, fundamentally, differential privacy can be considered a textit{pure}\u0000statistical concept. By leveraging a theorem due to David Blackwell, our focus\u0000is to demonstrate that the definition of differential privacy can be formally\u0000motivated from a hypothesis testing perspective, thereby showing that\u0000hypothesis testing is not merely convenient but also the right language for\u0000reasoning about differential privacy. This insight leads to the definition of\u0000$f$-differential privacy, which extends other differential privacy definitions\u0000through a representation theorem. We review techniques that render\u0000$f$-differential privacy a unified framework for analyzing privacy bounds in\u0000data analysis and machine learning. Applications of this differential privacy\u0000definition to private deep learning, private convex optimization, shuffled\u0000mechanisms, and U.S.~Census data are discussed to highlight the benefits of\u0000analyzing privacy bounds under this framework compared to existing\u0000alternatives.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Random-effects Approach to Regression Involving Many Categorical Predictors and Their Interactions","authors":"Hanmei Sun, Jiangshan Zhang, Jiming Jiang","doi":"arxiv-2409.09355","DOIUrl":"https://doi.org/arxiv-2409.09355","url":null,"abstract":"Linear model prediction with a large number of potential predictors is both\u0000statistically and computationally challenging. The traditional approaches are\u0000largely based on shrinkage selection/estimation methods, which are applicable\u0000even when the number of potential predictors is (much) larger than the sample\u0000size. A situation of the latter scenario occurs when the candidate predictors\u0000involve many binary indicators corresponding to categories of some categorical\u0000predictors as well as their interactions. We propose an alternative approach to\u0000the shrinkage prediction methods in such a case based on mixed model\u0000prediction, which effectively treats combinations of the categorical effects as\u0000random effects. We establish theoretical validity of the proposed method, and\u0000demonstrate empirically its advantage over the shrinkage methods. We also\u0000develop measures of uncertainty for the proposed method and evaluate their\u0000performance empirically. A real-data example is considered.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bounding the probability of causality under ordinal outcomes","authors":"Hanmei Sun, Chengfeng Shi, Qiang Zhao","doi":"arxiv-2409.09297","DOIUrl":"https://doi.org/arxiv-2409.09297","url":null,"abstract":"The probability of causation (PC) is often used in liability assessments. In\u0000a legal context, for example, where a patient suffered the side effect after\u0000taking a medication and sued the pharmaceutical company as a result, the value\u0000of the PC can help assess the likelihood that the side effect was caused by the\u0000medication, in other words, how likely it is that the patient will win the\u0000case. Beyond the issue of legal disputes, the PC plays an equally large role\u0000when one wants to go about explaining causal relationships between events that\u0000have already occurred in other areas. This article begins by reviewing the\u0000definitions and bounds of the probability of causality for binary outcomes,\u0000then generalizes them to ordinal outcomes. It demonstrates that incorporating\u0000additional mediator variable information in a complete mediation analysis\u0000provides a more refined bound compared to the simpler scenario where only\u0000exposure and outcome variables are considered.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Asymptotics of Wide Remedians","authors":"Philip T. Labo","doi":"arxiv-2409.09528","DOIUrl":"https://doi.org/arxiv-2409.09528","url":null,"abstract":"The remedian uses a $ktimes b$ matrix to approximate the median of $nleq\u0000b^{k}$ streaming input values by recursively replacing buffers of $b$ values\u0000with their medians, thereby ignoring its $200(lceil b/2rceil / b)^{k}%$ most\u0000extreme inputs. Rousseeuw & Bassett (1990) and Chao & Lin (1993); Chen & Chen\u0000(2005) study the remedian's distribution as $krightarrowinfty$ and as\u0000$k,brightarrowinfty$. The remedian's breakdown point vanishes as\u0000$krightarrowinfty$, but approaches $(1/2)^{k}$ as $brightarrowinfty$. We\u0000study the remedian's robust-regime distribution as $brightarrowinfty$,\u0000deriving a normal distribution for standardized (mean, median, remedian,\u0000remedian rank) as $brightarrowinfty$, thereby illuminating the remedian's\u0000accuracy in approximating the sample median. We derive the asymptotic\u0000efficiency of the remedian relative to the mean and the median. Finally, we\u0000discuss the estimation of more than one quantile at once, proposing an\u0000asymptotic distribution for the random vector that results when we apply\u0000remedian estimation in parallel to the components of i.i.d. random vectors.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Locally sharp goodness-of-fit testing in sup norm for high-dimensional counts","authors":"Subhodh Kotekal, Julien Chhor, Chao Gao","doi":"arxiv-2409.08871","DOIUrl":"https://doi.org/arxiv-2409.08871","url":null,"abstract":"We consider testing the goodness-of-fit of a distribution against\u0000alternatives separated in sup norm. We study the twin settings of\u0000Poisson-generated count data with a large number of categories and\u0000high-dimensional multinomials. In previous studies of different separation\u0000metrics, it has been found that the local minimax separation rate exhibits\u0000substantial heterogeneity and is a complicated function of the null\u0000distribution; the rate-optimal test requires careful tailoring to the null. In\u0000the setting of sup norm, this remains the case and we establish that the local\u0000minimax separation rate is determined by the finer decay behavior of the\u0000category rates. The upper bound is obtained by a test involving the sample\u0000maximum, and the lower bound argument involves reducing the original\u0000heteroskedastic null to an auxiliary homoskedastic null determined by the decay\u0000of the rates. Further, in a particular asymptotic setup, the sharp constants\u0000are identified.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"209 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-dimensional regression with a count response","authors":"Or Zilberman, Felix Abramovich","doi":"arxiv-2409.08821","DOIUrl":"https://doi.org/arxiv-2409.08821","url":null,"abstract":"We consider high-dimensional regression with a count response modeled by\u0000Poisson or negative binomial generalized linear model (GLM). We propose a\u0000penalized maximum likelihood estimator with a properly chosen complexity\u0000penalty and establish its adaptive minimaxity across models of various\u0000sparsity. To make the procedure computationally feasible for high-dimensional\u0000data we consider its LASSO and SLOPE convex surrogates. Their performance is\u0000illustrated through simulated and real-data examples.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-Organized State-Space Models with Artificial Dynamics","authors":"Yuan Chen, Mathieu Gerber, Christophe Andrieu, Randal Douc","doi":"arxiv-2409.08928","DOIUrl":"https://doi.org/arxiv-2409.08928","url":null,"abstract":"In this paper we consider a state-space model (SSM) parametrized by some\u0000parameter $theta$, and our aim is to perform joint parameter and state\u0000inference. A simple idea to perform this task, which almost dates back to the\u0000origin of the Kalman filter, is to replace the static parameter $theta$ by a\u0000Markov chain $(theta_t)_{tgeq 0}$ on the parameter space and then to apply a\u0000standard filtering algorithm to the extended, or self-organized SSM. However,\u0000the practical implementation of this idea in a theoretically justified way has\u0000remained an open problem. In this paper we fill this gap by introducing various\u0000possible constructions of the Markov chain $(theta_t)_{tgeq 0}$ that ensure\u0000the validity of the self-organized SSM (SO-SSM) for joint parameter and state\u0000inference. Notably, we show that theoretically valid SO-SSMs can be defined\u0000even if $|mathrm{Var}(theta_{t}|theta_{t-1})|$ converges to 0 slowly as\u0000$trightarrowinfty$. This result is important since, as illustrated in our\u0000numerical experiments, such models can be efficiently approximated using\u0000standard particle filter algorithms. While the idea studied in this work was\u0000first introduced for online inference in SSMs, it has also been proved to be\u0000useful for computing the maximum likelihood estimator (MLE) of a given SSM,\u0000since iterated filtering algorithms can be seen as particle filters applied to\u0000SO-SSMs for which the target parameter value is the MLE of interest. Based on\u0000this observation, we also derive constructions of $(theta_t)_{tgeq 0}$ and\u0000theoretical results tailored to these specific applications of SO-SSMs, and as\u0000a result, we introduce new iterated filtering algorithms. From a practical\u0000point of view, the algorithms introduced in this work have the merit of being\u0000simple to implement and only requiring minimal tuning to perform well.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}