BernoulliPub Date : 2021-04-10DOI: 10.3150/22-bej1579
Xiaozhuo Zhang, Zhiqiang Hou, Z. Bai, Jiang Hu
{"title":"Spiked eigenvalues of noncentral Fisher matrix with applications","authors":"Xiaozhuo Zhang, Zhiqiang Hou, Z. Bai, Jiang Hu","doi":"10.3150/22-bej1579","DOIUrl":"https://doi.org/10.3150/22-bej1579","url":null,"abstract":"In this paper, we investigate the asymptotic behavior of spiked eigenvalues of the noncentral Fisher matrix defined by ${mathbf F}_p={mathbf C}_n(mathbf S_N)^{-1}$, where ${mathbf C}_n$ is a noncentral sample covariance matrix defined by $(mathbf Xi+mathbf X)(mathbf Xi+mathbf X)^*/n$ and $mathbf S_N={mathbf Y}{mathbf Y}^*/N$. The matrices $mathbf X$ and $mathbf Y$ are two independent {Gaussian} arrays, with respective $ptimes n$ and $ptimes N$ and the Gaussian entries of them are textit {independent and identically distributed} (i.i.d.) with mean $0$ and variance $1$. When $p$, $n$, and $N$ grow to infinity proportionally, we establish a phase transition of the spiked eigenvalues of $mathbf F_p$. Furthermore, we derive the textit{central limiting theorem} (CLT) for the spiked eigenvalues of $mathbf F_p$. As an accessory to the proof of the above results, the fluctuations of the spiked eigenvalues of ${mathbf C}_n$ are studied, which should have its own interests. Besides, we develop the limits and CLT for the sample canonical correlation coefficients by the results of the spiked noncentral Fisher matrix and give three consistent estimators, including the population spiked eigenvalues and the population canonical correlation coefficients.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2021-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42668925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BernoulliPub Date : 2021-04-07DOI: 10.3150/23-bej1589
S. Favaro, Zacharie Naulet
{"title":"Near-optimal estimation of the unseen under regularly varying tail populations","authors":"S. Favaro, Zacharie Naulet","doi":"10.3150/23-bej1589","DOIUrl":"https://doi.org/10.3150/23-bej1589","url":null,"abstract":"Given $n$ samples from a population of individuals belonging to different species, what is the number $U$ of hitherto unseen species that would be observed if $lambda n$ new samples were collected? This is an important problem in many scientific endeavors, and it has been the subject of recent works introducing non-parametric estimators of $U$ that are minimax near-optimal and consistent all the way up to $lambda asymplog n$. These works do not rely on any assumption on the underlying unknown distribution $p$ of the population, and therefore, while providing a theory in its greatest generality, worst-case distributions may severely hamper the estimation of $U$ in concrete applications. In this paper, we consider the problem of strengthening the non-parametric framework for estimating $U$. Inspired by the estimation of rare probabilities in extreme value theory, and motivated by the ubiquitous power-law type distributions in many natural and social phenomena, we make use of a semi-parametric assumption regular variation of index $alpha in (0,1)$ for the tail behaviour of $p$. Under this assumption, we introduce an estimator of $U$ that is simple, linear in the sampling information, computationally efficient, and scalable to massive datasets. Then, uniformly over our class of regularly varying tail distributions, we show that the proposed estimator has provable guarantees: i) it is minimax near-optimal, up to a power of $log n$ factor; ii) it is consistent all of the way up to $loglambda asymp n^{alpha/2}/sqrt{log n}$, and this range is the best possible. This work presents the first study on the estimation of the unseen under regularly varying tail distributions. A numerical illustration of our methodology is presented for synthetic data and real data.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2021-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44428263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BernoulliPub Date : 2021-04-07DOI: 10.3150/22-bej1562
St'ephan Cl'emenccon, Hamid Jalalzai, St'ephane Lhaut, Anne Sabourin, J. Segers
{"title":"Concentration bounds for the empirical angular measure with statistical learning applications","authors":"St'ephan Cl'emenccon, Hamid Jalalzai, St'ephane Lhaut, Anne Sabourin, J. Segers","doi":"10.3150/22-bej1562","DOIUrl":"https://doi.org/10.3150/22-bej1562","url":null,"abstract":"The angular measure on the unit sphere characterizes the first-order dependence structure of the components of a random vector in extreme regions and is defined in terms of standardized margins. Its statistical recovery is an important step in learning problems involving observations far away from the center. In the common situation that the components of the vector have different distributions, the rank transformation offers a convenient and robust way of standardizing data in order to build an empirical version of the angular measure based on the most extreme observations. However, the study of the sampling distribution of the resulting empirical angular measure is challenging. It is the purpose of the paper to establish finite-sample bounds for the maximal deviations between the empirical and true angular measures, uniformly over classes of Borel sets of controlled combinatorial complexity. The bounds are valid with high probability and, up to logarithmic factors, scale as the square root of the effective sample size. The bounds are applied to provide performance guarantees for two statistical learning procedures tailored to extreme regions of the input space and built upon the empirical angular measure: binary classification in extreme regions through empirical risk minimization and unsupervised anomaly detection through minimum-volume sets of the sphere.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2021-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47877995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BernoulliPub Date : 2021-04-07DOI: 10.3150/22-bej1570
Mai Bui, Y. Pokern, P. Dellaportas
{"title":"Inference for partially observed Riemannian Ornstein–Uhlenbeck diffusions of covariance matrices","authors":"Mai Bui, Y. Pokern, P. Dellaportas","doi":"10.3150/22-bej1570","DOIUrl":"https://doi.org/10.3150/22-bej1570","url":null,"abstract":"We construct a generalization of the Ornstein-Uhlenbeck processes on the cone of covariance matrices endowed with the Log-Euclidean and the Affine-Invariant metrics. Our development exploits the Riemannian geometric structure of symmetric positive definite matrices viewed as a differential manifold. We then provide Bayesian inference for discretely observed diffusion processes of covariance matrices based on an MCMC algorithm built with the help of a novel diffusion bridge sampler accounting for the geometric structure. Our proposed algorithm is illustrated with a real data financial application.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2021-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49250092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BernoulliPub Date : 2021-04-02DOI: 10.3150/22-bej1474
Zinsou Max Debaly, L. Truquet
{"title":"Multivariate time series models for mixed data","authors":"Zinsou Max Debaly, L. Truquet","doi":"10.3150/22-bej1474","DOIUrl":"https://doi.org/10.3150/22-bej1474","url":null,"abstract":"We introduce a general approach for modeling the dynamic of multivariate time series when the data are of mixed type (binary/count/continuous). Our method is quite flexible and conditionally on past values, each coordinate at time $t$ can have a distribution compatible with a standard univariate time series model such as GARCH, ARMA, INGARCH or logistic models whereas past values of the other coordinates play the role of exogenous covariates in the dynamic. The simultaneous dependence in the multivariate time series can be modeled with a copula. Additional exogenous covariates are also allowed in the dynamic. We first study usual stability properties of these models and then show that autoregressive parameters can be consistently estimated equation-by-equation using a pseudo-maximum likelihood method, leading to a fast implementation even when the number of time series is large. Moreover, we prove consistency results when a parametric copula model is fitted to the time series and in the case of Gaussian copulas, we show that the likelihood estimator of the correlation matrix is strongly consistent. We carefully check all our assumptions for two prototypical examples: a GARCH/INGARCH model and logistic/log-linear INGARCH model. Our results are illustrated with numerical experiments as well as two real data sets.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":"232 3","pages":""},"PeriodicalIF":1.5,"publicationDate":"2021-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41263042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BernoulliPub Date : 2021-03-20DOI: 10.3150/22-bej1552
A. Kock, David Preinerstorfer
{"title":"Consistency of p-norm based tests in high dimensions: Characterization, monotonicity, domination","authors":"A. Kock, David Preinerstorfer","doi":"10.3150/22-bej1552","DOIUrl":"https://doi.org/10.3150/22-bej1552","url":null,"abstract":"Many commonly used test statistics are based on a norm measuring the evidence against the null hypothesis. To understand how the choice of a norm affects power properties of tests in high dimensions, we study the consistency sets of $p$-norm based tests in the prototypical framework of sequence models with unrestricted parameter spaces, the null hypothesis being that all observations have zero mean. The consistency set of a test is here defined as the set of all arrays of alternatives the test is consistent against as the dimension of the parameter space diverges. We characterize the consistency sets of $p$-norm based tests and find, in particular, that the consistency against an array of alternatives cannot be determined solely in terms of the $p$-norm of the alternative. Our characterization also reveals an unexpected monotonicity result: namely that the consistency set is strictly increasing in $p in (0, infty)$, such that tests based on higher $p$ strictly dominate those based on lower $p$ in terms of consistency. This monotonicity allows us to construct novel tests that dominate, with respect to their consistency behavior, all $p$-norm based tests without sacrificing size.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2021-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44743516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BernoulliPub Date : 2021-03-04DOI: 10.3150/22-bej1483
T. Karvonen
{"title":"Small sample spaces for Gaussian processes","authors":"T. Karvonen","doi":"10.3150/22-bej1483","DOIUrl":"https://doi.org/10.3150/22-bej1483","url":null,"abstract":"It is known that the membership in a given reproducing kernel Hilbert space (RKHS) of the samples of a Gaussian process $X$ is controlled by a certain nuclear dominance condition. However, it is less clear how to identify a\"small\"set of functions (not necessarily a vector space) that contains the samples. This article presents a general approach for identifying such sets. We use scaled RKHSs, which can be viewed as a generalisation of Hilbert scales, to define the sample support set as the largest set which is contained in every element of full measure under the law of $X$ in the $sigma$-algebra induced by the collection of scaled RKHS. This potentially non-measurable set is then shown to consist of those functions that can be expanded in terms of an orthonormal basis of the RKHS of the covariance kernel of $X$ and have their squared basis coefficients bounded away from zero and infinity, a result suggested by the Karhunen-Lo`{e}ve theorem.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2021-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42689494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BernoulliPub Date : 2021-03-01DOI: 10.3150/20-BEJ1260
Puying Zhao, Lei Wang, Junchao Shao
{"title":"Sufficient dimension reduction and instrument search for data with nonignorable nonresponse","authors":"Puying Zhao, Lei Wang, Junchao Shao","doi":"10.3150/20-BEJ1260","DOIUrl":"https://doi.org/10.3150/20-BEJ1260","url":null,"abstract":"","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46437883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BernoulliPub Date : 2021-03-01DOI: 10.3150/21-bej1417
R. Maller, S. Resnick, S. Shemehsavar
{"title":"Splitting the sample at the largest uncensored observation","authors":"R. Maller, S. Resnick, S. Shemehsavar","doi":"10.3150/21-bej1417","DOIUrl":"https://doi.org/10.3150/21-bej1417","url":null,"abstract":"We calculate finite sample and asymptotic distributions for the largest censored and uncensored survival times, and some related statistics, from a sample of survival data generated according to an iid censoring model. These statistics are important for assessing whether there is sufficient follow-up in the sample to be confident of the presence of immune or cured individuals in the population. A key structural result obtained is that, conditional on the value of the largest uncensored survival time, and knowing the number of censored observations exceeding this time, the sample partitions into two independent subsamples, each subsample having the distribution of an iid sample of censored survival times, of reduced size, from truncated random variables. This result provides valuable insight into the construction of censored survival data, and facilitates the calculation of explicit finite sample formulae. We illustrate for distributions of statistics useful for testing for sufficient follow-up in a sample, and apply extreme value methods to derive asymptotic distributions for some of those. MSC 2010 subject classifications: MSC2000 Subject Classifications: Primary 62N01, 62N02, 62N03, 62E10, 62E15, 62E20, G2G05; secondary 62F03, 62F05, 62F12, 62G32.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46551907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BernoulliPub Date : 2021-03-01DOI: 10.3150/20-BEJ1318
Ghurumuruhan Ganesan
{"title":"Minimum spanning trees of random geometric graphs with location dependent weights","authors":"Ghurumuruhan Ganesan","doi":"10.3150/20-BEJ1318","DOIUrl":"https://doi.org/10.3150/20-BEJ1318","url":null,"abstract":"Consider n nodes {Xi}1≤i≤n independently distributed in the unit square S, each according to a distribution f. Nodes Xi and Xj are joined by an edge if the Euclidean distance d(Xi,Xj) is less than rn, the adjacency distance and the resulting random graph Gn is called a random geometric graph (RGG). We now assign a location dependent weight to each edge of Gn and define MSTn to be the sum of the weights of the minimum spanning trees of all components of Gn. For values of rn above the connectivity regime, we obtain upper and lower bound deviation estimates for MSTn and L2-convergence of MSTn appropriately scaled and centred.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":"27 1","pages":"2473-2493"},"PeriodicalIF":1.5,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48226045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}