{"title":"过度参数化的可操作性:以负感知器为例","authors":"Andrea Montanari, Yiqiao Zhong, Kangjie Zhou","doi":"10.1007/s00440-023-01248-y","DOIUrl":null,"url":null,"abstract":"<p>In the negative perceptron problem we are given <i>n</i> data points <span>\\((\\varvec{x}_i,y_i)\\)</span>, where <span>\\(\\varvec{x}_i\\)</span> is a <i>d</i>-dimensional vector and <span>\\(y_i\\in \\{+1,-1\\}\\)</span> is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible <i>negative</i> margin. In other words, we want to find a unit norm vector <span>\\(\\varvec{\\theta }\\)</span> that maximizes <span>\\(\\min _{i\\le n}y_i\\langle \\varvec{\\theta },\\varvec{x}_i\\rangle \\)</span>. This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which <span>\\(n,d\\rightarrow \\infty \\)</span> with <span>\\(n/d\\rightarrow \\delta \\)</span>, and prove upper and lower bounds on the maximum margin <span>\\(\\kappa _{{\\textrm{s}}}(\\delta )\\)</span> or—equivalently—on its inverse function <span>\\(\\delta _{{\\textrm{s}}}(\\kappa )\\)</span>. In other words, <span>\\(\\delta _{{\\textrm{s}}}(\\kappa )\\)</span> is the overparametrization threshold: for <span>\\(n/d\\le \\delta _{{\\textrm{s}}}(\\kappa )-{\\varepsilon }\\)</span> a classifier achieving vanishing training error exists with high probability, while for <span>\\(n/d\\ge \\delta _{{\\textrm{s}}}(\\kappa )+{\\varepsilon }\\)</span> it does not. Our bounds on <span>\\(\\delta _{{\\textrm{s}}}(\\kappa )\\)</span> match to the leading order as <span>\\(\\kappa \\rightarrow -\\infty \\)</span>. We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold <span>\\(\\delta _{\\textrm{lin}}(\\kappa )\\)</span>. We observe a gap between the interpolation threshold <span>\\(\\delta _{{\\textrm{s}}}(\\kappa )\\)</span> and the linear programming threshold <span>\\(\\delta _{\\textrm{lin}}(\\kappa )\\)</span>, raising the question of the behavior of other algorithms.\n</p>","PeriodicalId":20527,"journal":{"name":"Probability Theory and Related Fields","volume":"113 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Tractability from overparametrization: the example of the negative perceptron\",\"authors\":\"Andrea Montanari, Yiqiao Zhong, Kangjie Zhou\",\"doi\":\"10.1007/s00440-023-01248-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In the negative perceptron problem we are given <i>n</i> data points <span>\\\\((\\\\varvec{x}_i,y_i)\\\\)</span>, where <span>\\\\(\\\\varvec{x}_i\\\\)</span> is a <i>d</i>-dimensional vector and <span>\\\\(y_i\\\\in \\\\{+1,-1\\\\}\\\\)</span> is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible <i>negative</i> margin. In other words, we want to find a unit norm vector <span>\\\\(\\\\varvec{\\\\theta }\\\\)</span> that maximizes <span>\\\\(\\\\min _{i\\\\le n}y_i\\\\langle \\\\varvec{\\\\theta },\\\\varvec{x}_i\\\\rangle \\\\)</span>. This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which <span>\\\\(n,d\\\\rightarrow \\\\infty \\\\)</span> with <span>\\\\(n/d\\\\rightarrow \\\\delta \\\\)</span>, and prove upper and lower bounds on the maximum margin <span>\\\\(\\\\kappa _{{\\\\textrm{s}}}(\\\\delta )\\\\)</span> or—equivalently—on its inverse function <span>\\\\(\\\\delta _{{\\\\textrm{s}}}(\\\\kappa )\\\\)</span>. In other words, <span>\\\\(\\\\delta _{{\\\\textrm{s}}}(\\\\kappa )\\\\)</span> is the overparametrization threshold: for <span>\\\\(n/d\\\\le \\\\delta _{{\\\\textrm{s}}}(\\\\kappa )-{\\\\varepsilon }\\\\)</span> a classifier achieving vanishing training error exists with high probability, while for <span>\\\\(n/d\\\\ge \\\\delta _{{\\\\textrm{s}}}(\\\\kappa )+{\\\\varepsilon }\\\\)</span> it does not. Our bounds on <span>\\\\(\\\\delta _{{\\\\textrm{s}}}(\\\\kappa )\\\\)</span> match to the leading order as <span>\\\\(\\\\kappa \\\\rightarrow -\\\\infty \\\\)</span>. We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold <span>\\\\(\\\\delta _{\\\\textrm{lin}}(\\\\kappa )\\\\)</span>. We observe a gap between the interpolation threshold <span>\\\\(\\\\delta _{{\\\\textrm{s}}}(\\\\kappa )\\\\)</span> and the linear programming threshold <span>\\\\(\\\\delta _{\\\\textrm{lin}}(\\\\kappa )\\\\)</span>, raising the question of the behavior of other algorithms.\\n</p>\",\"PeriodicalId\":20527,\"journal\":{\"name\":\"Probability Theory and Related Fields\",\"volume\":\"113 1\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2024-01-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Probability Theory and Related Fields\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1007/s00440-023-01248-y\",\"RegionNum\":1,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Probability Theory and Related Fields","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1007/s00440-023-01248-y","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
Tractability from overparametrization: the example of the negative perceptron
In the negative perceptron problem we are given n data points \((\varvec{x}_i,y_i)\), where \(\varvec{x}_i\) is a d-dimensional vector and \(y_i\in \{+1,-1\}\) is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible negative margin. In other words, we want to find a unit norm vector \(\varvec{\theta }\) that maximizes \(\min _{i\le n}y_i\langle \varvec{\theta },\varvec{x}_i\rangle \). This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which \(n,d\rightarrow \infty \) with \(n/d\rightarrow \delta \), and prove upper and lower bounds on the maximum margin \(\kappa _{{\textrm{s}}}(\delta )\) or—equivalently—on its inverse function \(\delta _{{\textrm{s}}}(\kappa )\). In other words, \(\delta _{{\textrm{s}}}(\kappa )\) is the overparametrization threshold: for \(n/d\le \delta _{{\textrm{s}}}(\kappa )-{\varepsilon }\) a classifier achieving vanishing training error exists with high probability, while for \(n/d\ge \delta _{{\textrm{s}}}(\kappa )+{\varepsilon }\) it does not. Our bounds on \(\delta _{{\textrm{s}}}(\kappa )\) match to the leading order as \(\kappa \rightarrow -\infty \). We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold \(\delta _{\textrm{lin}}(\kappa )\). We observe a gap between the interpolation threshold \(\delta _{{\textrm{s}}}(\kappa )\) and the linear programming threshold \(\delta _{\textrm{lin}}(\kappa )\), raising the question of the behavior of other algorithms.
期刊介绍:
Probability Theory and Related Fields publishes research papers in modern probability theory and its various fields of application. Thus, subjects of interest include: mathematical statistical physics, mathematical statistics, mathematical biology, theoretical computer science, and applications of probability theory to other areas of mathematics such as combinatorics, analysis, ergodic theory and geometry. Survey papers on emerging areas of importance may be considered for publication. The main languages of publication are English, French and German.