{"title":"Toward Universal Controller: Performance-Aware Self-Optimizing Reinforcement Learning for Discrete-Time Systems With Uncontrollable Factors","authors":"Jianfeng Zhang;Haoran Zhang;Chunhui Zhao","doi":"10.1109/TSMC.2025.3539349","DOIUrl":null,"url":null,"abstract":"The industrial system usually contains not only controllable variables (CVs) but also uncontrollable variables (unCVs), e.g., weather conditions and friction. These unCVs have a direct impact on system control performance. Despite the success of current deep reinforcement learning (DRL) control algorithms, most of them neglect the impact of unCVs, which can cause the deterioration of control performance and instability of the system. To perceive and eliminate the impact of unCVs, a performance-aware self-optimizing universal controller (PASOUC) is designed in this article. The PASOUC aims at integrating the representation of unCVs and controller design to perceive and eliminate the impact of unCVs under different conditions, which goes beyond most existing control methods. Technically, a historical trajectory-inspired control performance perceptron is developed to perceive the impact of unCVs on system control performance under different conditions. Subsequently, a new performance-aware reward is designed to integrate the representation of unCVs and controller design while training the DRL controller. In addition, the domain randomization (DR) training strategy is employed to learn a universal control policy, which can access the approximate optimal trajectory under nonideal conditions. In this way, the impact of unCVs can be eliminated. To handle the low efficiency of the DR training, the policy improvement-policy proximal optimization (PI-PPO) is proposed to enhance the convergence speed of the DR training by performing explicit policy improvement. Finally, illustrative examples are presented to demonstrate the superiority of the proposed method.","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 5","pages":"3249-3260"},"PeriodicalIF":8.6000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10896858/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The industrial system usually contains not only controllable variables (CVs) but also uncontrollable variables (unCVs), e.g., weather conditions and friction. These unCVs have a direct impact on system control performance. Despite the success of current deep reinforcement learning (DRL) control algorithms, most of them neglect the impact of unCVs, which can cause the deterioration of control performance and instability of the system. To perceive and eliminate the impact of unCVs, a performance-aware self-optimizing universal controller (PASOUC) is designed in this article. The PASOUC aims at integrating the representation of unCVs and controller design to perceive and eliminate the impact of unCVs under different conditions, which goes beyond most existing control methods. Technically, a historical trajectory-inspired control performance perceptron is developed to perceive the impact of unCVs on system control performance under different conditions. Subsequently, a new performance-aware reward is designed to integrate the representation of unCVs and controller design while training the DRL controller. In addition, the domain randomization (DR) training strategy is employed to learn a universal control policy, which can access the approximate optimal trajectory under nonideal conditions. In this way, the impact of unCVs can be eliminated. To handle the low efficiency of the DR training, the policy improvement-policy proximal optimization (PI-PPO) is proposed to enhance the convergence speed of the DR training by performing explicit policy improvement. Finally, illustrative examples are presented to demonstrate the superiority of the proposed method.
期刊介绍:
The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.