Anant Raj, Melih Barsbey, M. Gürbüzbalaban, Lingjiong Zhu, Umut Simsekli
{"title":"Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares","authors":"Anant Raj, Melih Barsbey, M. Gürbüzbalaban, Lingjiong Zhu, Umut Simsekli","doi":"10.48550/arXiv.2206.01274","DOIUrl":"https://doi.org/10.48550/arXiv.2206.01274","url":null,"abstract":"Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $xmapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $xmapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114435234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tournaments, Johnson Graphs, and NC-Teaching","authors":"H. Simon","doi":"10.48550/arXiv.2205.02792","DOIUrl":"https://doi.org/10.48550/arXiv.2205.02792","url":null,"abstract":"Quite recently a teaching model, called\"No-Clash Teaching\"or simply\"NC-Teaching\", had been suggested that is provably optimal in the following strong sense. First, it satisfies Goldman and Matthias' collusion-freeness condition. Second, the NC-teaching dimension (= NCTD) is smaller than or equal to the teaching dimension with respect to any other collusion-free teaching model. It has also been shown that any concept class which has NC-teaching dimension $d$ and is defined over a domain of size $n$ can have at most $2^d binom{n}{d}$ concepts. The main results in this paper are as follows. First, we characterize the maximum concept classes of NC-teaching dimension $1$ as classes which are induced by tournaments (= complete oriented graphs) in a very natural way. Second, we show that there exists a family $(cC_n)_{nge1}$ of concept classes such that the well known recursive teaching dimension (= RTD) of $cC_n$ grows logarithmically in $n = |cC_n|$ while, for every $nge1$, the NC-teaching dimension of $cC_n$ equals $1$. Since the recursive teaching dimension of a finite concept class $cC$ is generally bounded $log|cC|$, the family $(cC_n)_{nge1}$ separates RTD from NCTD in the most striking way. The proof of existence of the family $(cC_n)_{nge1}$ makes use of the probabilistic method and random tournaments. Third, we improve the afore-mentioned upper bound $2^dbinom{n}{d}$ by a factor of order $sqrt{d}$. The verification of the superior bound makes use of Johnson graphs and maximum subgraphs not containing large narrow cliques.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129930101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implicit Parameter-free Online Learning with Truncated Linear Models","authors":"Keyi Chen, Ashok Cutkosky, Francesco Orabona","doi":"10.48550/arXiv.2203.10327","DOIUrl":"https://doi.org/10.48550/arXiv.2203.10327","url":null,"abstract":"Parameter-free algorithms are online learning algorithms that do not require setting learning rates. They achieve optimal regret with respect to the distance between the initial point and any competitor. Yet, parameter-free algorithms do not take into account the geometry of the losses. Recently, in the stochastic optimization literature, it has been proposed to instead use truncated linear lower bounds, which produce better performance by more closely modeling the losses. In particular, truncated linear models greatly reduce the problem of overshooting the minimum of the loss function. Unfortunately, truncated linear models cannot be used with parameter-free algorithms because the updates become very expensive to compute. In this paper, we propose new parameter-free algorithms that can take advantage of truncated linear models through a new update that has an\"implicit\"flavor. Based on a novel decomposition of the regret, the new update is efficient, requires only one gradient at each step, never overshoots the minimum of the truncated model, and retains the favorable parameter-free properties. We also conduct an empirical study demonstrating the practical utility of our algorithms.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131508395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Greenstreet, Nicholas J. A. Harvey, Victor S. Portella
{"title":"Efficient and Optimal Fixed-Time Regret with Two Experts","authors":"L. Greenstreet, Nicholas J. A. Harvey, Victor S. Portella","doi":"10.48550/arXiv.2203.07577","DOIUrl":"https://doi.org/10.48550/arXiv.2203.07577","url":null,"abstract":"Prediction with expert advice is a foundational problem in online learning. In instances with $T$ rounds and $n$ experts, the classical Multiplicative Weights Update method suffers at most $sqrt{(T/2)ln n}$ regret when $T$ is known beforehand. Moreover, this is asymptotically optimal when both $T$ and $n$ grow to infinity. However, when the number of experts $n$ is small/fixed, algorithms with better regret guarantees exist. Cover showed in 1967 a dynamic programming algorithm for the two-experts problem restricted to ${0,1}$ costs that suffers at most $sqrt{T/2pi} + O(1)$ regret with $O(T^2)$ pre-processing time. In this work, we propose an optimal algorithm for prediction with two experts' advice that works even for costs in $[0,1]$ and with $O(1)$ processing time per turn. Our algorithm builds up on recent work on the experts problem based on techniques and tools from stochastic calculus.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130266845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Metric Entropy Duality and the Sample Complexity of Outcome Indistinguishability","authors":"Lunjia Hu, Charlotte Peale, Omer Reingold","doi":"10.48550/arXiv.2203.04536","DOIUrl":"https://doi.org/10.48550/arXiv.2203.04536","url":null,"abstract":"We give the first sample complexity characterizations for outcome indistinguishability, a theoretical framework of machine learning recently introduced by Dwork, Kim, Reingold, Rothblum, and Yona (STOC 2021). In outcome indistinguishability, the goal of the learner is to output a predictor that cannot be distinguished from the target predictor by a class $D$ of distinguishers examining the outcomes generated according to the predictors' predictions. In the distribution-specific and realizable setting where the learner is given the data distribution together with a predictor class $P$ containing the target predictor, we show that the sample complexity of outcome indistinguishability is characterized by the metric entropy of $P$ w.r.t. the dual Minkowski norm defined by $D$, and equivalently by the metric entropy of $D$ w.r.t. the dual Minkowski norm defined by $P$. This equivalence makes an intriguing connection to the long-standing metric entropy duality conjecture in convex geometry. Our sample complexity characterization implies a variant of metric entropy duality, which we show is nearly tight. In the distribution-free setting, we focus on the case considered by Dwork et al. where $P$ contains all possible predictors, hence the sample complexity only depends on $D$. In this setting, we show that the sample complexity of outcome indistinguishability is characterized by the fat-shattering dimension of $D$. We also show a strong sample complexity separation between realizable and agnostic outcome indistinguishability in both the distribution-free and the distribution-specific settings. This is in contrast to distribution-free (resp. distribution-specific) PAC learning where the sample complexity in both the realizable and the agnostic settings can be characterized by the VC dimension (resp. metric entropy).","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132145881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Qiuyi Zhang
{"title":"Leveraging Initial Hints for Free in Stochastic Linear Bandits","authors":"Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Qiuyi Zhang","doi":"10.48550/arXiv.2203.04274","DOIUrl":"https://doi.org/10.48550/arXiv.2203.04274","url":null,"abstract":"We study the setting of optimizing with bandit feedback with additional prior knowledge provided to the learner in the form of an initial hint of the optimal action. We present a novel algorithm for stochastic linear bandits that uses this hint to improve its regret to $tilde O(sqrt{T})$ when the hint is accurate, while maintaining a minimax-optimal $tilde O(dsqrt{T})$ regret independent of the quality of the hint. Furthermore, we provide a Pareto frontier of tight tradeoffs between best-case and worst-case regret, with matching lower bounds. Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worst-case performance, implying that our algorithm adapts to the quality of the hint for free. We also provide an extension of our algorithm to the case of $m$ initial hints, showing that we can achieve a $tilde O(m^{2/3}sqrt{T})$ regret.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134195860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adversarially Robust Learning with Tolerance","authors":"H. Ashtiani, Vinayak Pathak, Ruth Urner","doi":"10.48550/arXiv.2203.00849","DOIUrl":"https://doi.org/10.48550/arXiv.2203.00849","url":null,"abstract":"We initiate the study of tolerant adversarial PAC-learning with respect to metric perturbation sets. In adversarial PAC-learning, an adversary is allowed to replace a test point $x$ with an arbitrary point in a closed ball of radius $r$ centered at $x$. In the tolerant version, the error of the learner is compared with the best achievable error with respect to a slightly larger perturbation radius $(1+gamma)r$. This simple tweak helps us bridge the gap between theory and practice and obtain the first PAC-type guarantees for algorithmic techniques that are popular in practice. Our first result concerns the widely-used ``perturb-and-smooth'' approach for adversarial learning. For perturbation sets with doubling dimension $d$, we show that a variant of these approaches PAC-learns any hypothesis class $mathcal{H}$ with VC-dimension $v$ in the $gamma$-tolerant adversarial setting with $Oleft(frac{v(1+1/gamma)^{O(d)}}{varepsilon}right)$ samples. This is in contrast to the traditional (non-tolerant) setting in which, as we show, the perturb-and-smooth approach can provably fail. Our second result shows that one can PAC-learn the same class using $widetilde{O}left(frac{d.vlog(1+1/gamma)}{varepsilon^2}right)$ samples even in the agnostic setting. This result is based on a novel compression-based algorithm, and achieves a linear dependence on the doubling dimension as well as the VC-dimension. This is in contrast to the non-tolerant setting where there is no known sample complexity upper bound that depend polynomially on the VC-dimension.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129477997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"La educación ambiental en los medios televisivos. Estudio de caso: Oromar TV","authors":"Erik Alexander Cumba Castro","doi":"10.17163/alt.v15n1.2020.10","DOIUrl":"https://doi.org/10.17163/alt.v15n1.2020.10","url":null,"abstract":"The current research article has the purpose of analyzing environmental education in television media in the province of Manabi. For which, it was decided to take the Oromar TV channel as a case study. This with the objective of measuring the social impact caused by the mass media in regard to the awareness and care of the environment in this province. In addition to examining the production of training content focused on environmental education within the programming of this channel. The methodology that was applied for the investigation is of qualitative type, so that the technique of documentary analysis was used for the revision of the programming of the Oromar TV channel, this was carried out in a sample period of two months. The results obtained show that there are shortcomings in the programming of the Oromar TV channel, due to the scarce productions of educational content. Therefore, it is concluded that, in the absence of an increase in training television programs, and total absence of specialized productions in the area of environmental education in the Oromar TV channel, that could cause a lack of knowledge in the television audience in regard to in matters of prevention and care of the environment in the province of Manabi.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133087933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Intrinsic Complexity of Partial Learning","authors":"Sanjay Jain, E. Kinber","doi":"10.1007/978-3-319-46379-7_12","DOIUrl":"https://doi.org/10.1007/978-3-319-46379-7_12","url":null,"abstract":"","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129640349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Pattern Languages over Groups","authors":"R. Hölzl, Sanjay Jain, F. Stephan","doi":"10.1007/978-3-319-46379-7_13","DOIUrl":"https://doi.org/10.1007/978-3-319-46379-7_13","url":null,"abstract":"","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128962034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}