A Sublinear-Regret Reinforcement Learning Algorithm on Constrained Markov Decision Processes with reset action

Proceedings of the 4th International Conference on Machine Learning and Soft Computing Pub Date : 2020-01-17 DOI:10.1145/3380688.3380706

Takashi Watanabe, T. Sakuragawa

引用次数: 0

Abstract

In this paper, we study model-based reinforcement learning in an unknown constrained Markov Decision Processes (CMDPs) with reset action. We propose an algorithm, Constrained-UCRL, which uses confidence interval like UCRL2, and solves linear programming problem to compute policy at the start of each episode. We show that Constrained-UCRL achieves sublinear regret bounds Õ(SA1/2T3/4) up to logarithmic factors with high probability for both the gain and the constraint violations.

查看原文本刊更多论文

具有重置作用的约束马尔可夫决策过程的次线性后悔强化学习算法

本文研究了具有重置作用的未知约束马尔可夫决策过程中基于模型的强化学习问题。我们提出了一种约束ucrl算法，它像UCRL2一样使用置信区间，并解决线性规划问题，在每个事件开始时计算策略。我们表明，对于增益和约束违反，Constrained-UCRL以高概率达到对数因子的次线性后悔界Õ(SA1/2T3/4)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 4th International Conference on Machine Learning and Soft Computing

自引率

0.00%

发文量