Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, Wen Sun
{"title":"Corruption-Robust Exploration in Episodic Reinforcement Learning","authors":"Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, Wen Sun","doi":"10.1287/moor.2021.0202","DOIUrl":"https://doi.org/10.1287/moor.2021.0202","url":null,"abstract":"We initiate the study of episodic reinforcement learning (RL) under adversarial corruptions in both the rewards and the transition probabilities of the underlying system, extending recent results for the special case of multiarmed bandits. We provide a framework that modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on optimism in the face of uncertainty by complementing them with principles from action elimination. Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms that (a) attain near-optimal regret in the absence of corruptions and (b) adapt to unknown levels of corruption, enjoying regret guarantees that degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) and linear Markov decision process settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee that accommodates any deviation from purely independent and identically distributed transitions in the bandit-feedback model for episodic reinforcement learning.Supplemental Material: The online appendix is available at https://doi.org/10.1287/moor.2021.0202 .","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"61 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Markov Decision Processes with Observation Costs: Framework and Computation with a Penalty Scheme","authors":"Christoph Reisinger, Jonathan Tam","doi":"10.1287/moor.2023.0172","DOIUrl":"https://doi.org/10.1287/moor.2023.0172","url":null,"abstract":"We consider Markov decision processes where the state of the chain is only given at chosen observation times and of a cost. Optimal strategies involve the optimization of observation times as well as the subsequent action values. We consider the finite horizon and discounted infinite horizon problems as well as an extension with parameter uncertainty. By including the time elapsed from observations as part of the augmented Markov system, the value function satisfies a system of quasivariational inequalities (QVIs). Such a class of QVIs can be seen as an extension to the interconnected obstacle problem. We prove a comparison principle for this class of QVIs, which implies the uniqueness of solutions to our proposed problem. Penalty methods are then utilized to obtain arbitrarily accurate solutions. Finally, we perform numerical experiments on three applications that illustrate our framework.Funding: J. Tam is supported by the Engineering and Physical Sciences Research Council [Grant 2269738].","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"26 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Liquid Welfare Guarantees for No-Regret Learning in Sequential Budgeted Auctions","authors":"Giannis Fikioris, Éva Tardos","doi":"10.1287/moor.2023.0274","DOIUrl":"https://doi.org/10.1287/moor.2023.0274","url":null,"abstract":"We study the liquid welfare in sequential first-price auctions with budgeted buyers. We use a behavioral model for the buyers, assuming a learning style guarantee: the utility of each buyer is within a [Formula: see text] factor ([Formula: see text]) of the utility achievable by shading their value with the same factor at each iteration. We show a [Formula: see text] price of anarchy for liquid welfare when valuations are additive. This is in stark contrast to sequential second-price auctions, where the resulting liquid welfare can be arbitrarily smaller than the maximum liquid welfare, even when [Formula: see text]. We prove a lower bound of [Formula: see text] on the liquid welfare loss under the given assumption in first-price auctions. Our liquid welfare results extend when buyers have submodular valuations over the set of items they win across iterations with a slightly worse price of anarchy bound of [Formula: see text] compared with the guarantee for the additive case.Funding: G. Fikioris is supported in part by the Air Force Office of Scientific Research [Grants FA9550-19-1-0183 and FA9550-23-1-0068], the Department of Defense (DoD) through the National Defense Science & Engineering Graduate (NDSEG) Fellowship Program, and the Onassis Foundation [Scholarship ID F ZS 068-1/2022-2023]. É. Tardos is supported in part by the NSF [Grant CCF-1408673] and AFOSR [Grants FA9550-19-1-0183, FA9550-23-1-0410, and FA9550-23-1-0068].","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"32 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Theory of Alternating Paths and Blossoms from the Perspective of Minimum Length","authors":"Vijay V. Vazirani","doi":"10.1287/moor.2020.0388","DOIUrl":"https://doi.org/10.1287/moor.2020.0388","url":null,"abstract":"The Micali–Vazirani (MV) algorithm for finding a maximum cardinality matching in general graphs, which was published in 1980, remains to this day the most efficient known algorithm for the problem. The current paper gives the first complete and correct proof of this algorithm. The MV algorithm resorts to finding minimum-length augmenting paths. However, such paths fail to satisfy an elementary property, called breadth first search honesty in this paper. In the absence of this property, an exponential time algorithm appears to be called for—just for finding one such path. On the other hand, the MV algorithm accomplishes this and additional tasks in linear time. The saving grace is the various “footholds” offered by the underlying structure, which the algorithm uses in order to perform its key tasks efficiently. The theory expounded in this paper elucidates this rich structure and yields a proof of correctness of the algorithm. It may also be of independent interest as a set of well-knit graph-theoretic facts.Funding: This work was supported in part by the National Science Foundation [Grant CCF-2230414].","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"41 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140933967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Is There a Golden Parachute in Sannikov’s Principal–Agent Problem?","authors":"Dylan Possamaï, Nizar Touzi","doi":"10.1287/moor.2022.0305","DOIUrl":"https://doi.org/10.1287/moor.2022.0305","url":null,"abstract":"This paper provides a complete review of the continuous-time optimal contracting problem introduced by Sannikov in the extended context allowing for possibly different discount rates for both parties. The agent’s problem is to seek for optimal effort given the compensation scheme proposed by the principal over a random horizon. Then, given the optimal agent’s response, the principal determines the best compensation scheme in terms of running payment, retirement, and lump-sum payment at retirement. A golden parachute is a situation where the agent ceases any effort at some positive stopping time and receives a payment afterward, possibly under the form of a lump-sum payment or of a continuous stream of payments. We show that a golden parachute only exists in certain specific circumstances. This is in contrast with the results claimed by Sannikov, where the only requirement is a positive agent’s marginal cost of effort at zero. In the general case, we prove that an agent with positive reservation utility is either never retired by the principal or retired above some given threshold (as in Sannikov’s solution). We show that different discount factors induce a facelifted utility function, which allows us to reduce the analysis to a setting similar to the equal-discount rates one. Finally, we also confirm that an agent with small reservation utility does have an informational rent, meaning that the principal optimally offers him a contract with strictly higher utility than his participation value.","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"33 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140887228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diego Goldsztajn, Sem C. Borst, Johan S. H. van Leeuwaarden
{"title":"Learning and Balancing Unknown Loads in Large-Scale Systems","authors":"Diego Goldsztajn, Sem C. Borst, Johan S. H. van Leeuwaarden","doi":"10.1287/moor.2021.0212","DOIUrl":"https://doi.org/10.1287/moor.2021.0212","url":null,"abstract":"Consider a system of identical server pools where tasks with exponentially distributed service times arrive as a time-inhomogeneous Poisson process. An admission threshold is used in an inner control loop to assign incoming tasks to server pools, while in an outer control loop, a learning scheme adjusts this threshold over time to keep it aligned with the unknown offered load of the system. In a many-server regime, we prove that the learning scheme reaches an equilibrium along intervals of time when the normalized offered load per server pool is suitably bounded and that this results in a balanced distribution of the load. Furthermore, we establish a similar result when tasks with Coxian distributed service times arrive at a constant rate and the threshold is adjusted using only the total number of tasks in the system. The novel proof technique developed in this paper, which differs from a traditional fluid limit analysis, allows us to handle rapid variations of the first learning scheme, triggered by excursions of the occupancy process that have vanishing size. Moreover, our approach allows us to characterize the asymptotic behavior of the system with Coxian distributed service times without relying on a fluid limit of a detailed state descriptor.Funding: The work in this paper was supported by the Nederlandse Organisatie voor Wetenschappelijk Onderzoek [Gravitation Grant NETWORKS-024.002.003 and Vici Grant 202.068].","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"18 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140832251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating a Function and Its Derivatives Under a Smoothness Condition","authors":"Eunji Lim","doi":"10.1287/moor.2020.0161","DOIUrl":"https://doi.org/10.1287/moor.2020.0161","url":null,"abstract":"We consider the problem of estimating an unknown function [Formula: see text] and its partial derivatives from a noisy data set of n observations, where we make no assumptions about [Formula: see text] except that it is smooth in the sense that it has square integrable partial derivatives of order m. A natural candidate for the estimator of [Formula: see text] in such a case is the best fit to the data set that satisfies a certain smoothness condition. This estimator can be seen as a least squares estimator subject to an upper bound on some measure of smoothness. Another useful estimator is the one that minimizes the degree of smoothness subject to an upper bound on the average of squared errors. We prove that these two estimators are computable as solutions to quadratic programs, establish the consistency of these estimators and their partial derivatives, and study the convergence rate as [Formula: see text]. The effectiveness of the estimators is illustrated numerically in a setting where the value of a stock option and its second derivative are estimated as functions of the underlying stock price.","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"40 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correlated Equilibria for Mean Field Games with Progressive Strategies","authors":"Ofelia Bonesini, Luciano Campi, Markus Fischer","doi":"10.1287/moor.2022.0357","DOIUrl":"https://doi.org/10.1287/moor.2022.0357","url":null,"abstract":"In a discrete space and time framework, we study the mean field game limit for a class of symmetric N-player games based on the notion of correlated equilibrium. We give a definition of correlated solution that allows us to construct approximate N-player correlated equilibria that are robust with respect to progressive deviations. We illustrate our definition by way of an example with explicit solutions.Funding: O. Bonesini acknowledges financial support from Engineering and Physical Sciences Research Council [Grant EP/T032146/1]. M. Fischer acknowledges partial support through the University of Padua [Research Project BIRD229791 “Stochastic mean field control and the Schrödinger problem”].","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"54 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140832372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Convexification of Bilinear Terms over Network Polytopes","authors":"Erfan Khademnia, Danial Davarnia","doi":"10.1287/moor.2023.0001","DOIUrl":"https://doi.org/10.1287/moor.2023.0001","url":null,"abstract":"It is well-known that the McCormick relaxation for the bilinear constraint z = xy gives the convex hull over the box domains for x and y. In network applications where the domain of bilinear variables is described by a network polytope, the McCormick relaxation, also referred to as linearization, fails to provide the convex hull and often leads to poor dual bounds. We study the convex hull of the set containing bilinear constraints [Formula: see text] where x<jats:sub>i</jats:sub> represents the arc-flow variable in a network polytope, and y<jats:sub>j</jats:sub> is in a simplex. For the case where the simplex contains a single y variable, we introduce a systematic procedure to obtain the convex hull of the above set in the original space of variables, and show that all facet-defining inequalities of the convex hull can be obtained explicitly through identifying a special tree structure in the underlying network. For the generalization where the simplex contains multiple y variables, we design a constructive procedure to obtain an important class of facet-defining inequalities for the convex hull of the underlying bilinear set that is characterized by a special forest structure in the underlying network. Computational experiments conducted on different applications show the effectiveness of the proposed methods in improving the dual bounds obtained from alternative techniques.Funding: This work was supported by Air Force Office of Scientific Research [Grant FA9550-23-1-0183]; National Science Foundation, Division of Civil, Mechanical and Manufacturing Innovation [Grant 2338641].Supplemental Material: The online appendix is available at https://doi.org/10.1287/moor.2023.0001 .","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"10 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140798174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov
{"title":"Finite-Time High-Probability Bounds for Polyak–Ruppert Averaged Iterates of Linear Stochastic Approximation","authors":"Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov","doi":"10.1287/moor.2022.0179","DOIUrl":"https://doi.org/10.1287/moor.2022.0179","url":null,"abstract":"This paper provides a finite-time analysis of linear stochastic approximation (LSA) algorithms with fixed step size, a core method in statistics and machine learning. LSA is used to compute approximate solutions of a d-dimensional linear system [Formula: see text] for which [Formula: see text] can only be estimated by (asymptotically) unbiased observations [Formula: see text]. We consider here the case where [Formula: see text] is an a sequence of independent and identically distributed random variables sequence or a uniformly geometrically ergodic Markov chain. We derive pth moment and high-probability deviation bounds for the iterates defined by LSA and its Polyak–Ruppert-averaged version. Our finite-time instance-dependent bounds for the averaged LSA iterates are sharp in the sense that the leading term we obtain coincides with the local asymptotic minimax limit. Moreover, the remainder terms of our bounds admit a tight dependence on the mixing time [Formula: see text] of the underlying chain and the norm of the noise variables. We emphasize that our result requires the LSA step size to scale only with logarithm of the problem dimension d.Funding: The work of A. Durmus and E. Moulines was partly supported by [Grant ANR-19-CHIA-0002]. This project received funding from the European Research Council [ERC-SyG OCEAN Grant 101071601]. The research of A. Naumov and S. Samsonov was prepared within the framework of the HSE University Basic Research Program.","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"185 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140612440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}