基于时间逻辑规范的减少方差深度强化学习

Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems Pub Date : 2019-04-16 DOI:10.1145/3302509.3311053

Qitong Gao, Davood Hajinezhad, Yan Zhang, Y. Kantaros, M. Zavlanos

{"title":"基于时间逻辑规范的减少方差深度强化学习","authors":"Qitong Gao, Davood Hajinezhad, Yan Zhang, Y. Kantaros, M. Zavlanos","doi":"10.1145/3302509.3311053","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a model-free reinforcement learning method to synthesize control policies for mobile robots modeled as Markov Decision Process (MDP) with unknown transition probabilities that satisfy Linear Temporal Logic (LTL) specifications. Specifically, we develop a reduced variance deep Q-Learning technique that relies on Neural Networks (NN) to approximate the state-action values of the MDP and employs a reward function that depends on the accepting condition of the Deterministic Rabin Automaton (DRA) that captures the LTL specification. The key idea is to convert the deep Q-Learning problem into a nonconvex max-min optimization problem with a finite-sum structure, and develop an Arrow-Hurwicz-Uzawa type stochastic reduced variance algorithm with constant stepsize to solve it. Unlike Stochastic Gradient Descent (SGD) methods that are often used in deep reinforcement learning, our method can estimate the gradients of an unknown loss function more accurately and can improve the stability of the training process. Moreover, our method does not require learning the transition probabilities in the MDP, constructing a product MDP, or computing Accepting Maximal End Components (AMECs). This allows the robot to learn an optimal policy even if the environment cannot be modeled accurately or if AMECs do not exist. In the latter case, the resulting control policies minimize the frequency with which the system enters bad states in the DRA that violate the task specifications. To the best of our knowledge, this is the first model-free deep reinforcement learning algorithm that can synthesize policies that maximize the probability of satisfying an LTL specification even if AMECs do not exist. Rigorous convergence analysis and rate of convergence are provided for the proposed algorithm as well as numerical experiments that validate our method.","PeriodicalId":413733,"journal":{"name":"Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems","volume":"2 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":"{\"title\":\"Reduced variance deep reinforcement learning with temporal logic specifications\",\"authors\":\"Qitong Gao, Davood Hajinezhad, Yan Zhang, Y. Kantaros, M. Zavlanos\",\"doi\":\"10.1145/3302509.3311053\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a model-free reinforcement learning method to synthesize control policies for mobile robots modeled as Markov Decision Process (MDP) with unknown transition probabilities that satisfy Linear Temporal Logic (LTL) specifications. Specifically, we develop a reduced variance deep Q-Learning technique that relies on Neural Networks (NN) to approximate the state-action values of the MDP and employs a reward function that depends on the accepting condition of the Deterministic Rabin Automaton (DRA) that captures the LTL specification. The key idea is to convert the deep Q-Learning problem into a nonconvex max-min optimization problem with a finite-sum structure, and develop an Arrow-Hurwicz-Uzawa type stochastic reduced variance algorithm with constant stepsize to solve it. Unlike Stochastic Gradient Descent (SGD) methods that are often used in deep reinforcement learning, our method can estimate the gradients of an unknown loss function more accurately and can improve the stability of the training process. Moreover, our method does not require learning the transition probabilities in the MDP, constructing a product MDP, or computing Accepting Maximal End Components (AMECs). This allows the robot to learn an optimal policy even if the environment cannot be modeled accurately or if AMECs do not exist. In the latter case, the resulting control policies minimize the frequency with which the system enters bad states in the DRA that violate the task specifications. To the best of our knowledge, this is the first model-free deep reinforcement learning algorithm that can synthesize policies that maximize the probability of satisfying an LTL specification even if AMECs do not exist. Rigorous convergence analysis and rate of convergence are provided for the proposed algorithm as well as numerical experiments that validate our method.\",\"PeriodicalId\":413733,\"journal\":{\"name\":\"Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems\",\"volume\":\"2 3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"48\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3302509.3311053\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3302509.3311053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

摘要

在本文中，我们提出了一种无模型强化学习方法来综合移动机器人的控制策略，该控制策略建模为马尔可夫决策过程(MDP)，具有未知的转移概率，满足线性时间逻辑(LTL)规范。具体来说，我们开发了一种减少方差的深度q -学习技术，该技术依赖于神经网络(NN)来近似MDP的状态-动作值，并采用一个奖励函数，该函数依赖于捕获LTL规范的确定性拉宾自动机(DRA)的接受条件。其关键思想是将深度Q-Learning问题转化为具有有限和结构的非凸极大极小优化问题，并开发了一种Arrow-Hurwicz-Uzawa型的恒步长随机减方差算法来求解。与深度强化学习中经常使用的随机梯度下降(SGD)方法不同，我们的方法可以更准确地估计未知损失函数的梯度，并且可以提高训练过程的稳定性。此外，我们的方法不需要学习MDP中的转移概率，构建产品MDP，或计算接受最大终端组件(AMECs)。这使得机器人即使在环境不能准确建模或不存在amec的情况下也能学习到最优策略。在后一种情况下，产生的控制策略将系统在DRA中进入违反任务规范的坏状态的频率降至最低。据我们所知，这是第一个无模型深度强化学习算法，即使不存在amec，也可以综合策略，使满足LTL规范的概率最大化。给出了严格的收敛分析和收敛速度，并进行了数值实验验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Reduced variance deep reinforcement learning with temporal logic specifications

In this paper, we propose a model-free reinforcement learning method to synthesize control policies for mobile robots modeled as Markov Decision Process (MDP) with unknown transition probabilities that satisfy Linear Temporal Logic (LTL) specifications. Specifically, we develop a reduced variance deep Q-Learning technique that relies on Neural Networks (NN) to approximate the state-action values of the MDP and employs a reward function that depends on the accepting condition of the Deterministic Rabin Automaton (DRA) that captures the LTL specification. The key idea is to convert the deep Q-Learning problem into a nonconvex max-min optimization problem with a finite-sum structure, and develop an Arrow-Hurwicz-Uzawa type stochastic reduced variance algorithm with constant stepsize to solve it. Unlike Stochastic Gradient Descent (SGD) methods that are often used in deep reinforcement learning, our method can estimate the gradients of an unknown loss function more accurately and can improve the stability of the training process. Moreover, our method does not require learning the transition probabilities in the MDP, constructing a product MDP, or computing Accepting Maximal End Components (AMECs). This allows the robot to learn an optimal policy even if the environment cannot be modeled accurately or if AMECs do not exist. In the latter case, the resulting control policies minimize the frequency with which the system enters bad states in the DRA that violate the task specifications. To the best of our knowledge, this is the first model-free deep reinforcement learning algorithm that can synthesize policies that maximize the probability of satisfying an LTL specification even if AMECs do not exist. Rigorous convergence analysis and rate of convergence are provided for the proposed algorithm as well as numerical experiments that validate our method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems

自引率

0.00%

发文量