利用局部多项式回归实现基于梯度的经验风险最小化

Q1 Mathematics

Stochastic Systems Pub Date : 2024-03-26 DOI:10.1287/stsy.2022.0003

Ali Jadbabaie, Anuran Makur, Devavrat Shah

{"title":"利用局部多项式回归实现基于梯度的经验风险最小化","authors":"Ali Jadbabaie, Anuran Makur, Devavrat Shah","doi":"10.1287/stsy.2022.0003","DOIUrl":null,"url":null,"abstract":"In this paper, we consider the widely studied problem of empirical risk minimization (ERM) of strongly convex and smooth loss functions using iterative gradient-based methods. A major goal of the existing literature has been to compare different prototypical algorithms, such as batch gradient descent (GD) or stochastic gradient descent (SGD), by analyzing their rates of convergence to ϵ-approximate solutions with respect to the number of gradient computations, which is also known as the oracle complexity. For example, the oracle complexity of GD is [Formula: see text], where n is the number of training samples and p is the parameter space dimension. When n is large, this can be prohibitively expensive in practice, and SGD is preferred due to its oracle complexity of [Formula: see text]. Such standard analyses only utilize the smoothness of the loss function in the parameter being optimized. In contrast, we demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of GD, SGD, and their variants in important regimes. Specifically, at every iteration, our proposed algorithm, Local Polynomial Interpolation-based Gradient Descent (LPI-GD), first performs local polynomial regression with a virtual batch of data points to learn the gradient of the loss function and then estimates the true gradient of the ERM objective function. We establish that the oracle complexity of LPI-GD is [Formula: see text], where d is the data space dimension, and the gradient of the loss function is assumed to belong to an η-Hölder class with respect to the data. Our proof extends the analysis of local polynomial regression in nonparametric statistics to provide supremum norm guarantees for interpolation in multivariate settings and also exploits tools from the inexact GD literature. Unlike the complexities of GD and SGD, the complexity of our method depends on d. However, our algorithm outperforms GD, SGD, and their variants in oracle complexity for a broad range of settings where d is small relative to n. For example, with typical loss functions (such as squared or cross-entropy loss), when [Formula: see text] for any [Formula: see text] and [Formula: see text] is at the statistical limit, our method can be made to require [Formula: see text] oracle calls for any [Formula: see text], while SGD and GD require [Formula: see text] and [Formula: see text] oracle calls, respectively.Funding: This work was supported in part by the Office of Naval Research [Grant N000142012394], in part by the Army Research Office [Multidisciplinary University Research Initiative Grant W911NF-19-1-0217], and in part by the National Science Foundation [Transdisciplinary Research In Principles Of Data Science, Foundations of Data Science].","PeriodicalId":36337,"journal":{"name":"Stochastic Systems","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gradient-Based Empirical Risk Minimization Using Local Polynomial Regression\",\"authors\":\"Ali Jadbabaie, Anuran Makur, Devavrat Shah\",\"doi\":\"10.1287/stsy.2022.0003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we consider the widely studied problem of empirical risk minimization (ERM) of strongly convex and smooth loss functions using iterative gradient-based methods. A major goal of the existing literature has been to compare different prototypical algorithms, such as batch gradient descent (GD) or stochastic gradient descent (SGD), by analyzing their rates of convergence to ϵ-approximate solutions with respect to the number of gradient computations, which is also known as the oracle complexity. For example, the oracle complexity of GD is [Formula: see text], where n is the number of training samples and p is the parameter space dimension. When n is large, this can be prohibitively expensive in practice, and SGD is preferred due to its oracle complexity of [Formula: see text]. Such standard analyses only utilize the smoothness of the loss function in the parameter being optimized. In contrast, we demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of GD, SGD, and their variants in important regimes. Specifically, at every iteration, our proposed algorithm, Local Polynomial Interpolation-based Gradient Descent (LPI-GD), first performs local polynomial regression with a virtual batch of data points to learn the gradient of the loss function and then estimates the true gradient of the ERM objective function. We establish that the oracle complexity of LPI-GD is [Formula: see text], where d is the data space dimension, and the gradient of the loss function is assumed to belong to an η-Hölder class with respect to the data. Our proof extends the analysis of local polynomial regression in nonparametric statistics to provide supremum norm guarantees for interpolation in multivariate settings and also exploits tools from the inexact GD literature. Unlike the complexities of GD and SGD, the complexity of our method depends on d. However, our algorithm outperforms GD, SGD, and their variants in oracle complexity for a broad range of settings where d is small relative to n. For example, with typical loss functions (such as squared or cross-entropy loss), when [Formula: see text] for any [Formula: see text] and [Formula: see text] is at the statistical limit, our method can be made to require [Formula: see text] oracle calls for any [Formula: see text], while SGD and GD require [Formula: see text] and [Formula: see text] oracle calls, respectively.Funding: This work was supported in part by the Office of Naval Research [Grant N000142012394], in part by the Army Research Office [Multidisciplinary University Research Initiative Grant W911NF-19-1-0217], and in part by the National Science Foundation [Transdisciplinary Research In Principles Of Data Science, Foundations of Data Science].\",\"PeriodicalId\":36337,\"journal\":{\"name\":\"Stochastic Systems\",\"volume\":\"14 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Stochastic Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1287/stsy.2022.0003\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Stochastic Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1287/stsy.2022.0003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们使用基于梯度的迭代方法，研究了强凸平滑损失函数的经验风险最小化（ERM）这一广泛研究的问题。现有文献的一个主要目标是比较不同的原型算法，如批量梯度下降算法（GD）或随机梯度下降算法（SGD），分析它们收敛到ϵ近似解的率与梯度计算次数（也称为oracle复杂度）的关系。例如，GD 的oracle 复杂度为[公式：见正文]，其中 n 是训练样本数，p 是参数空间维数。当 n 较大时，这种方法在实际应用中会过于昂贵，而 SGD 的甲骨文复杂度为[公式：见正文]，因此更受青睐。这种标准分析只能利用被优化参数的损失函数的平滑性。相比之下，我们证明了当数据中的损失函数是平滑的，我们可以在每次迭代中学习神谕，并在重要情况下击败 GD、SGD 及其变体的神谕复杂度。具体来说，在每次迭代时，我们提出的算法--基于局部多项式插值的梯度下降算法（LPI-GD）--首先用一批虚拟数据点进行局部多项式回归，学习损失函数的梯度，然后估计 ERM 目标函数的真实梯度。我们确定 LPI-GD 的算法复杂度为 [公式：见正文]，其中 d 是数据空间维度，损失函数的梯度假定与数据有关，属于 η-Hölder 类。我们的证明扩展了非参数统计中的局部多项式回归分析，为多变量设置中的插值提供了至高规范保证，同时也利用了非精确 GD 文献中的工具。与 GD 和 SGD 的复杂性不同，我们的方法的复杂性取决于 d。然而，在 d 相对于 n 较小的各种情况下，我们的算法在 Oracle 复杂性方面优于 GD、SGD 及其变体。例如，对于典型的损失函数（如平方损失或交叉熵损失），当任意[公式：见正文]的[公式：见正文]和[公式：见正文]处于统计极限时，我们的方法可以使任意[公式：见正文]的[公式：见正文]都不需要[公式：见正文]的神谕调用，而SGD和GD则分别需要[公式：见正文]和[公式：见正文]的神谕调用：这项工作部分得到海军研究办公室[Grant N000142012394]的支持，部分得到陆军研究办公室[Multidisciplinary University Research Initiative Grant W911NF-19-1-0217]的支持，部分得到美国国家科学基金会[Transdisciplinary Research In Principles Of Data Science, Foundations of Data Science]的支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Gradient-Based Empirical Risk Minimization Using Local Polynomial Regression

In this paper, we consider the widely studied problem of empirical risk minimization (ERM) of strongly convex and smooth loss functions using iterative gradient-based methods. A major goal of the existing literature has been to compare different prototypical algorithms, such as batch gradient descent (GD) or stochastic gradient descent (SGD), by analyzing their rates of convergence to ϵ-approximate solutions with respect to the number of gradient computations, which is also known as the oracle complexity. For example, the oracle complexity of GD is [Formula: see text], where n is the number of training samples and p is the parameter space dimension. When n is large, this can be prohibitively expensive in practice, and SGD is preferred due to its oracle complexity of [Formula: see text]. Such standard analyses only utilize the smoothness of the loss function in the parameter being optimized. In contrast, we demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of GD, SGD, and their variants in important regimes. Specifically, at every iteration, our proposed algorithm, Local Polynomial Interpolation-based Gradient Descent (LPI-GD), first performs local polynomial regression with a virtual batch of data points to learn the gradient of the loss function and then estimates the true gradient of the ERM objective function. We establish that the oracle complexity of LPI-GD is [Formula: see text], where d is the data space dimension, and the gradient of the loss function is assumed to belong to an η-Hölder class with respect to the data. Our proof extends the analysis of local polynomial regression in nonparametric statistics to provide supremum norm guarantees for interpolation in multivariate settings and also exploits tools from the inexact GD literature. Unlike the complexities of GD and SGD, the complexity of our method depends on d. However, our algorithm outperforms GD, SGD, and their variants in oracle complexity for a broad range of settings where d is small relative to n. For example, with typical loss functions (such as squared or cross-entropy loss), when [Formula: see text] for any [Formula: see text] and [Formula: see text] is at the statistical limit, our method can be made to require [Formula: see text] oracle calls for any [Formula: see text], while SGD and GD require [Formula: see text] and [Formula: see text] oracle calls, respectively.Funding: This work was supported in part by the Office of Naval Research [Grant N000142012394], in part by the Army Research Office [Multidisciplinary University Research Initiative Grant W911NF-19-1-0217], and in part by the National Science Foundation [Transdisciplinary Research In Principles Of Data Science, Foundations of Data Science].

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Stochastic Systems Decision Sciences-Statistics, Probability and Uncertainty

CiteScore

3.70

自引率

0.00%

发文量