On-policy and Off-policy Q-learning algorithms with policy iteration for two-wheeled inverted pendulum systems

IF 5.2 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Robotics and Autonomous Systems Pub Date : 2025-06-30 DOI:10.1016/j.robot.2025.105111

Ba Quoc Anh Nguyen , Ngoc Trung Dang , Thanh Tung Le , Phuong Nam Dao

{"title":"On-policy and Off-policy Q-learning algorithms with policy iteration for two-wheeled inverted pendulum systems","authors":"Ba Quoc Anh Nguyen , Ngoc Trung Dang , Thanh Tung Le , Phuong Nam Dao","doi":"10.1016/j.robot.2025.105111","DOIUrl":null,"url":null,"abstract":"<div><div>This article delves into the investigation of On-policy and Off-policy Q-learning algorithms for controlling two-wheeled inverted pendulum (TWIP) robots in situations where knowledge about the dynamic system is uncertain. Both on-policy and off-policy Q-learning algorithms ensure optimal and model-free control by employing a data collection approach without the knowledge of model. The On-policy algorithm performs real-time data collection, continuously gathering data and iteratively calculating a new control policy until it converges to the optimal value. In contrast, the Off-policy algorithm collects data only once and applies it to the system after completing the learning process. To enhance computational efficiency and minimize the amount of data required, the TWIP system is divided into two Sub-systems. These Sub-systems consist of smaller system matrices that can be controlled independently. This division reduces the data collection burden and accelerates the calculation speed of the algorithms. The utilization of Off-policy techniques proves to be advantageous in developing algorithms with data efficiency and achieving higher accuracy. The influence of probing noise on the Q-function is comprehensively considered in both proposed algorithms. By utilizing a single data set and eliminating the influence of noise, the Off-policy techniques enhance algorithm performance. Finally, the article presents simulation results of the TWIP system to validate the effectiveness of the two proposed control schemes.</div></div>","PeriodicalId":49592,"journal":{"name":"Robotics and Autonomous Systems","volume":"193 ","pages":"Article 105111"},"PeriodicalIF":5.2000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotics and Autonomous Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0921889025002088","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

This article delves into the investigation of On-policy and Off-policy Q-learning algorithms for controlling two-wheeled inverted pendulum (TWIP) robots in situations where knowledge about the dynamic system is uncertain. Both on-policy and off-policy Q-learning algorithms ensure optimal and model-free control by employing a data collection approach without the knowledge of model. The On-policy algorithm performs real-time data collection, continuously gathering data and iteratively calculating a new control policy until it converges to the optimal value. In contrast, the Off-policy algorithm collects data only once and applies it to the system after completing the learning process. To enhance computational efficiency and minimize the amount of data required, the TWIP system is divided into two Sub-systems. These Sub-systems consist of smaller system matrices that can be controlled independently. This division reduces the data collection burden and accelerates the calculation speed of the algorithms. The utilization of Off-policy techniques proves to be advantageous in developing algorithms with data efficiency and achieving higher accuracy. The influence of probing noise on the Q-function is comprehensively considered in both proposed algorithms. By utilizing a single data set and eliminating the influence of noise, the Off-policy techniques enhance algorithm performance. Finally, the article presents simulation results of the TWIP system to validate the effectiveness of the two proposed control schemes.

查看原文本刊更多论文

两轮倒立摆系统策略迭代的On-policy和Off-policy Q-learning算法

本文深入研究了在动态系统知识不确定的情况下控制两轮倒立摆（TWIP）机器人的On-policy和Off-policy Q-learning算法。策略上和策略外的Q-learning算法都通过采用不需要模型知识的数据收集方法来确保最优和无模型控制。On-policy算法实时采集数据，不断采集数据，迭代计算新的控制策略，直到收敛到最优值。而Off-policy算法只采集一次数据，完成学习过程后应用到系统中。为了提高计算效率和减少所需的数据量，TWIP系统分为两个子系统。这些子系统由可以独立控制的较小的系统矩阵组成。这种划分减少了数据采集的负担，提高了算法的计算速度。事实证明，利用Off-policy技术有利于开发数据效率高、精度高的算法。两种算法都综合考虑了探测噪声对q函数的影响。Off-policy技术通过利用单个数据集和消除噪声的影响，提高了算法的性能。最后给出了TWIP系统的仿真结果，验证了两种控制方案的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Robotics and Autonomous Systems 工程技术-机器人学

CiteScore

9.00

自引率

7.00%

发文量

164

审稿时长

4.5 months

期刊介绍： Robotics and Autonomous Systems will carry articles describing fundamental developments in the field of robotics, with special emphasis on autonomous systems. An important goal of this journal is to extend the state of the art in both symbolic and sensory based robot control and learning in the context of autonomous systems. Robotics and Autonomous Systems will carry articles on the theoretical, computational and experimental aspects of autonomous systems, or modules of such systems.