Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

International Conference on Algorithmic Learning Theory Pub Date : 2022-10-10 DOI:10.48550/arXiv.2210.04946

Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, A. Lazaric

{"title":"Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path","authors":"Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, A. Lazaric","doi":"10.48550/arXiv.2210.04946","DOIUrl":null,"url":null,"abstract":"We study the sample complexity of learning an $\\epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\\min}$, and maximum expected cost of the optimal policy over all states $B_{\\star}$, where any algorithm requires at least $\\Omega(SAB_{\\star}^3/(c_{\\min}\\epsilon^2))$ samples to return an $\\epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{\\min}=0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this result with lower bounds when prior knowledge of the hitting time of the optimal policy is available and when we restrict optimality by competing against policies with bounded hitting time. Finally, we design an algorithm with matching upper bounds in these cases. This settles the sample complexity of learning $\\epsilon$-optimal polices in SSP with generative models. We also initiate the study of learning $\\epsilon$-optimal policies without access to a generative model (i.e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general. On the other hand, efficient learning can be made possible if we assume the agent can directly reach the goal state from any state by paying a fixed cost. We then establish the first upper and lower bounds under this assumption. Finally, using similar analytic tools, we prove that horizon-free regret is impossible in SSPs under general costs, resolving an open problem in (Tarbouriech et al., 2021c).","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Algorithmic Learning Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.04946","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

We study the sample complexity of learning an $\epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any algorithm requires at least $\Omega(SAB_{\star}^3/(c_{\min}\epsilon^2))$ samples to return an $\epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{\min}=0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this result with lower bounds when prior knowledge of the hitting time of the optimal policy is available and when we restrict optimality by competing against policies with bounded hitting time. Finally, we design an algorithm with matching upper bounds in these cases. This settles the sample complexity of learning $\epsilon$-optimal polices in SSP with generative models. We also initiate the study of learning $\epsilon$-optimal policies without access to a generative model (i.e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general. On the other hand, efficient learning can be made possible if we assume the agent can directly reach the goal state from any state by paying a fixed cost. We then establish the first upper and lower bounds under this assumption. Finally, using similar analytic tools, we prove that horizon-free regret is impossible in SSPs under general costs, resolving an open problem in (Tarbouriech et al., 2021c).

查看原文本刊更多论文

达到目标是困难的:解决随机最短路径的样本复杂度

研究了随机最短路径(SSP)问题中学习$\epsilon$ -最优策略的样本复杂度。我们首先推导出样本复杂度界限，当学习者有机会获得一个生成模型。我们证明存在一个最坏情况的SSP实例，该实例具有$S$状态、$A$动作、最小代价$c_{\min}$和所有状态上最优策略的最大期望代价$B_{\star}$，其中任何算法都需要至少$\Omega(SAB_{\star}^3/(c_{\min}\epsilon^2))$个样本才能高概率地返回$\epsilon$ -最优策略。令人惊讶的是，这意味着无论何时$c_{\min}=0$一个SSP问题都可能是不可学习的，从而揭示了在SSP中学习比在有限视界和折扣设置中学习要困难得多。当最优策略的命中时间的先验知识是可用的，当我们通过与命中时间有限的策略竞争来限制最优性时，我们用下界来补充这个结果。最后，我们设计了一种具有匹配上界的算法。这解决了生成模型在SSP中学习$\epsilon$ -最优策略的样本复杂性问题。我们还启动了学习$\epsilon$ -不使用生成模型的最优策略的研究(即所谓的最佳策略识别问题)，并表明样本高效学习通常是不可能的。另一方面，如果我们假设智能体可以通过支付固定成本从任何状态直接到达目标状态，则可以实现高效学习。然后我们在这个假设下建立了第一个上界和下界。最后，使用类似的分析工具，我们证明了一般成本下ssp不可能存在无视界后悔，解决了(Tarbouriech et al.， 2021c)中的一个开放问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Algorithmic Learning Theory

自引率

0.00%

发文量