Feasible Policy Iteration With Guaranteed Safe Exploration

IF 9.4 1区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Cybernetics Pub Date : 2025-03-18 DOI:10.1109/TCYB.2025.3542223

Yuhang Zhang;Yujie Yang;Shengbo Eben Li;Yao Lyu;Jingliang Duan;Zhilong Zheng;Dezhao Zhang

{"title":"Feasible Policy Iteration With Guaranteed Safe Exploration","authors":"Yuhang Zhang;Yujie Yang;Shengbo Eben Li;Yao Lyu;Jingliang Duan;Zhilong Zheng;Dezhao Zhang","doi":"10.1109/TCYB.2025.3542223","DOIUrl":null,"url":null,"abstract":"Safety guarantee is an important topic when training real-world tasks with reinforcement learning (RL). During online environmental exploration, any constraint violation can lead to significant property damage and risks to personnel. Existing safe RL methods either exclusively address safety concerns after reaching optimality or incorporate a certain degree of tolerance for constraint violations during training. This article proposes a feasible policy iteration framework that can guarantee absolute safety during online exploration, i.e., constraint violations never happen in real-world interactions. The key to maintaining absolute safety lies in confining the environmental exploration at each step always within the feasible region of the current policy. This feasible region is described by a newly defined constraint decay function with uncertainty, ensuring the forward invariance of the feasible region under the worst case. Within the proposed framework, the feasible region maintains its monotonic expanding property and converges to its maximum extent, even though only local samples are available, i.e., the agent only has access to samples within the feasible region. Meanwhile, the trained policy also improves monotonically within its corresponding feasible region if one can use different updating rules inside and outside the feasible region. Finally, practical algorithms are designed with the actor-critic-scenery architecture, consisting of three modules: 1) safe exploration; 2) model error estimation; and 3) network update. Experimental results indicate that our algorithms achieve performance comparable to baselines while maintaining zero constraint violation throughout the entire training process. In contrast, the baseline algorithm typically requires thousands of constraint violations to achieve the same performance. These findings suggest a substantial potential for applying feasible policy iteration in real-world tasks, enabling the online evolution of intricate systems.","PeriodicalId":13112,"journal":{"name":"IEEE Transactions on Cybernetics","volume":"55 5","pages":"2327-2340"},"PeriodicalIF":9.4000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cybernetics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10931147/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Safety guarantee is an important topic when training real-world tasks with reinforcement learning (RL). During online environmental exploration, any constraint violation can lead to significant property damage and risks to personnel. Existing safe RL methods either exclusively address safety concerns after reaching optimality or incorporate a certain degree of tolerance for constraint violations during training. This article proposes a feasible policy iteration framework that can guarantee absolute safety during online exploration, i.e., constraint violations never happen in real-world interactions. The key to maintaining absolute safety lies in confining the environmental exploration at each step always within the feasible region of the current policy. This feasible region is described by a newly defined constraint decay function with uncertainty, ensuring the forward invariance of the feasible region under the worst case. Within the proposed framework, the feasible region maintains its monotonic expanding property and converges to its maximum extent, even though only local samples are available, i.e., the agent only has access to samples within the feasible region. Meanwhile, the trained policy also improves monotonically within its corresponding feasible region if one can use different updating rules inside and outside the feasible region. Finally, practical algorithms are designed with the actor-critic-scenery architecture, consisting of three modules: 1) safe exploration; 2) model error estimation; and 3) network update. Experimental results indicate that our algorithms achieve performance comparable to baselines while maintaining zero constraint violation throughout the entire training process. In contrast, the baseline algorithm typically requires thousands of constraint violations to achieve the same performance. These findings suggest a substantial potential for applying feasible policy iteration in real-world tasks, enabling the online evolution of intricate systems.

查看原文本刊更多论文

有保证安全探索的可行策略迭代。

在使用强化学习（RL）训练现实任务时，安全保证是一个重要的问题。在在线环境勘探过程中，任何违反约束的行为都可能导致重大的财产损失和人员风险。现有的安全RL方法要么在达到最优状态后专门解决安全问题，要么在训练期间对违反约束的行为进行一定程度的容忍。本文提出了一个可行的策略迭代框架，可以保证在线探索过程中的绝对安全，即在现实世界的交互中永远不会发生违反约束的情况。保证绝对安全的关键在于将每一步的环境勘探都限制在现行政策的可行范围内。该可行域用新定义的不确定性约束衰减函数来描述，保证了该可行域在最坏情况下的前向不变性。在该框架下，即使只有局部样本，agent也只能接触到可行区域内的样本，但可行区域仍保持单调扩展的性质，并收敛到最大程度。同时，如果可以在可行域内外使用不同的更新规则，训练好的策略在其对应的可行域内也会单调改进。最后，根据演员临界场景架构设计了实用算法，包括三个模块：1)安全探索；2)模型误差估计；3)网络更新。实验结果表明，我们的算法在整个训练过程中保持零约束违反的情况下，达到了与基线相当的性能。相比之下，基线算法通常需要数千个约束违反才能达到相同的性能。这些发现表明，在现实世界的任务中应用可行的策略迭代具有巨大的潜力，从而使复杂系统的在线进化成为可能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cybernetics COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

25.40

自引率

11.00%

发文量

1869

期刊介绍： The scope of the IEEE Transactions on Cybernetics includes computational approaches to the field of cybernetics. Specifically, the transactions welcomes papers on communication and control across machines or machine, human, and organizations. The scope includes such areas as computational intelligence, computer vision, neural networks, genetic algorithms, machine learning, fuzzy systems, cognitive systems, decision making, and robotics, to the extent that they contribute to the theme of cybernetics or demonstrate an application of cybernetics principles.