{"title":"Randomized Wasserstein Barycenter Computation: Resampling with Statistical Guarantees","authors":"F. Heinemann, A. Munk, Y. Zemel","doi":"10.1137/20m1385263","DOIUrl":"https://doi.org/10.1137/20m1385263","url":null,"abstract":"We propose a hybrid resampling method to approximate finitely supported Wasserstein barycenters on large-scale datasets, which can be combined with any exact solver. Nonasymptotic bounds on the expected error of the objective value as well as the barycenters themselves allow to calibrate computational cost and statistical accuracy. The rate of these upper bounds is shown to be optimal and independent of the underlying dimension, which appears only in the constants. Using a simple modification of the subgradient descent algorithm of Cuturi and Doucet, we showcase the applicability of our method on a myriad of simulated datasets, as well as a real-data example which are out of reach for state of the art algorithms for computing Wasserstein barycenters.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"69 1","pages":"229-259"},"PeriodicalIF":0.0,"publicationDate":"2020-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83296297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic Tverberg Theorems With Applications in Multiclass Logistic Regression, Separability, and Centerpoints of Data","authors":"J. D. Loera, T. A. Hogan","doi":"10.1137/19m1277102","DOIUrl":"https://doi.org/10.1137/19m1277102","url":null,"abstract":"We present new stochastic geometry theorems that give bounds on the probability that $m$ random data classes all contain a point in common in their convex hulls. These theorems relate to the existe...","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"116 1","pages":"1151-1166"},"PeriodicalIF":0.0,"publicationDate":"2020-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79373299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting, and Regularization","authors":"Ke Wang, Christos Thrampoulidis","doi":"10.1137/21m1415121","DOIUrl":"https://doi.org/10.1137/21m1415121","url":null,"abstract":"Deep neural networks generalize well despite being exceedingly overparameterized and being trained without explicit regularization. This curious phenomenon has inspired extensive research activity in establishing its statistical principles: Under what conditions is it observed? How do these depend on the data and on the training algorithm? When does regularization benefit generalization? While such questions remain wide open for deep neural nets, recent works have attempted gaining insights by studying simpler, often linear, models. Our paper contributes to this growing line of work by examining binary linear classification under a generative Gaussian mixture model. Motivated by recent results on the implicit bias of gradient descent, we study both max-margin SVM classifiers (corresponding to logistic loss) and min-norm interpolating classifiers (corresponding to least-squares loss). First, we leverage an idea introduced in [V. Muthukumar et al., arXiv:2005.08054, (2020)] to relate the SVM solution to the min-norm interpolating solution. Second, we derive novel non-asymptotic bounds on the classification error of the latter. Combining the two, we present novel sufficient conditions on the covariance spectrum and on the signal-to-noise ratio (SNR) under which interpolating estimators achieve asymptotically optimal performance as overparameterization increases. Interestingly, our results extend to a noisy model with constant probability noise flips. Contrary to previously studied discriminative data models, our results emphasize the crucial role of the SNR and its interplay with the data covariance. Finally, via a combination of analytical arguments and numerical demonstrations we identify conditions under which the interpolating estimator performs better than corresponding regularized estimates.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"20 1","pages":"260-284"},"PeriodicalIF":0.0,"publicationDate":"2020-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75813460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory Capacity of Neural Networks with Threshold and Rectified Linear Unit Activations","authors":"R. Vershynin","doi":"10.1137/20m1314884","DOIUrl":"https://doi.org/10.1137/20m1314884","url":null,"abstract":"Overwhelming theoretical and empirical evidence shows that mildly overparametrized neural networks---those with more connections than the size of the training data---are often able to memorize the ...","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"19 1","pages":"1004-1033"},"PeriodicalIF":0.0,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76907136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination","authors":"Cole Hawkins, Xing-er Liu, Zheng Zhang","doi":"10.1137/21m1391444","DOIUrl":"https://doi.org/10.1137/21m1391444","url":null,"abstract":"While post-training model compression can greatly reduce the inference cost of a deep neural network, uncompressed training still consumes a huge amount of hardware resources, run-time and energy. It is highly desirable to directly train a compact neural network from scratch with low memory and low computational cost. Low-rank tensor decomposition is one of the most effective approaches to reduce the memory and computing requirements of large-size neural networks. However, directly training a low-rank tensorized neural network is a very challenging task because it is hard to determine a proper tensor rank {it a priori}, which controls the model complexity and compression ratio in the training process. This paper presents a novel end-to-end framework for low-rank tensorized training of neural networks. We first develop a flexible Bayesian model that can handle various low-rank tensor formats (e.g., CP, Tucker, tensor train and tensor-train matrix) that compress neural network parameters in training. This model can automatically determine the tensor ranks inside a nonlinear forward model, which is beyond the capability of existing Bayesian tensor methods. We further develop a scalable stochastic variational inference solver to estimate the posterior density of large-scale problems in training. Our work provides the first general-purpose rank-adaptive framework for end-to-end tensorized training. Our numerical results on various neural network architectures show orders-of-magnitude parameter reduction and little accuracy loss (or even better accuracy) in the training process. Specifically, on a very large deep learning recommendation system with over $4.2times 10^9$ model parameters, our method can reduce the variables to only $1.6times 10^5$ automatically in the training process (i.e., by $2.6times 10^4$ times) while achieving almost the same accuracy.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"45 1","pages":"46-71"},"PeriodicalIF":0.0,"publicationDate":"2020-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80791672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Consistency of Archetypal Analysis","authors":"B. Osting, Dong Wang, Yiming Xu, Dominique Zosso","doi":"10.1137/20M1331792","DOIUrl":"https://doi.org/10.1137/20M1331792","url":null,"abstract":"Archetypal analysis is an unsupervised learning method that uses a convex polytope to summarize multivariate data. For fixed $k$, the method finds a convex polytope with $k$ vertices, called archetype points, such that the polytope is contained in the convex hull of the data and the mean squared distance between the data and the polytope is minimal. In this paper, we prove a consistency result that shows if the data is independently sampled from a probability measure with bounded support, then the archetype points converge to a solution of the continuum version of the problem, of which we identify and establish several properties. We also obtain the convergence rate of the optimal objective values under appropriate assumptions on the distribution. If the data is independently sampled from a distribution with unbounded support, we also prove a consistency result for a modified method that penalizes the dispersion of the archetype points. Our analysis is supported by detailed computational experiments of the archetype points for data sampled from the uniform distribution in a disk, the normal distribution, an annular distribution, and a Gaussian mixture model.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"50 1","pages":"1-30"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90037347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhishek K. Gupta, Hao Chen, Jianzong Pi, Gaurav Tendolkar
{"title":"Some Limit Properties of Markov Chains Induced by Recursive Stochastic Algorithms","authors":"Abhishek K. Gupta, Hao Chen, Jianzong Pi, Gaurav Tendolkar","doi":"10.1137/19m1258104","DOIUrl":"https://doi.org/10.1137/19m1258104","url":null,"abstract":"Recursive stochastic algorithms have gained significant attention in the recent past due to data-driven applications. Examples include stochastic gradient descent for solving large-scale optimizati...","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"97 1","pages":"967-1003"},"PeriodicalIF":0.0,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79472710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Weir, Benjamin Walker, Lenka Zdeborov'a, P. Mucha
{"title":"Multilayer Modularity Belief Propagation to Assess Detectability of Community Structure","authors":"W. Weir, Benjamin Walker, Lenka Zdeborov'a, P. Mucha","doi":"10.1137/19m1279812","DOIUrl":"https://doi.org/10.1137/19m1279812","url":null,"abstract":"Modularity-based community detection encompasses a number of widely used, efficient heuristics for identification of structure in networks. Recently, a belief propagation approach to modularity opt...","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"1 1","pages":"872-900"},"PeriodicalIF":0.0,"publicationDate":"2020-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90478008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sequential Construction and Dimension Reduction of Gaussian Processes Under Inequality Constraints","authors":"F. Bachoc, A. F. López-Lopera, O. Roustant","doi":"10.1137/21m1407513","DOIUrl":"https://doi.org/10.1137/21m1407513","url":null,"abstract":"Accounting for inequality constraints, such as boundedness, monotonicity or convexity, is challenging when modeling costly-to-evaluate black box functions. In this regard, finite-dimensional Gaussian process (GP) models bring a valuable solution, as they guarantee that the inequality constraints are satisfied everywhere. Nevertheless, these models are currently restricted to small dimensional situations (up to dimension 5). Addressing this issue, we introduce the MaxMod algorithm that sequentially inserts one-dimensional knots or adds active variables, thereby performing at the same time dimension reduction and efficient knot allocation. We prove the convergence of this algorithm. In intermediary steps of the proof, we propose the notion of multi-affine extension and study its properties. We also prove the convergence of finite-dimensional GPs, when the knots are not dense in the input space, extending the recent literature. With simulated and real data, we demonstrate that the MaxMod algorithm remains efficient in higher dimension (at least in dimension 20), and has a smaller computational complexity than other constrained GP models from the state-of-the-art, to reach a given approximation error.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45269075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emmanuel Chevallier, Didong Li, Yulong Lu, D. Dunson
{"title":"Exponential-Wrapped Distributions on Symmetric Spaces","authors":"Emmanuel Chevallier, Didong Li, Yulong Lu, D. Dunson","doi":"10.1137/21m1461551","DOIUrl":"https://doi.org/10.1137/21m1461551","url":null,"abstract":". In many applications, the curvature of the space supporting the data makes the statistical modelling challenging. In this paper we discuss the construction and use of probability distributions wrapped around manifolds using exponential maps. These distributions have already been used on specific manifolds. We describe their construction in the unifying framework of affine locally symmetric spaces. Affine locally symmetric spaces are a broad class of manifolds containing many manifolds encountered in data sciences. We show that on these spaces, exponential-wrapped distributions enjoy interesting properties for practical use. We provide the generic expression of the Jacobian appearing in these distributions and compute it on two particular examples: Grassmannians and pseudo-hyperboloids. We illustrate the interest of such distributions in a classification experiment on simulated data.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47949469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}