{"title":"离散时间马尔可夫跳跃系统的一般 TD-Q 学习控制方法。","authors":"Jiwei Wen, Huiwen Xue, Xiaoli Luan, Peng Shi","doi":"10.1016/j.isatra.2025.02.032","DOIUrl":null,"url":null,"abstract":"<p><p>This paper develops a novel temporal difference Q (TD-Q) learning approach, designed to address the robust control challenge in discrete-time Markov jump systems (MJSs) which are characterized by entirely unknown dynamics and transition probabilities (TPs). The model-free TD-Q learning method is uniquely comprehensive, including two special cases: Q learning for MJSs with unknown dynamics, and TD learning for MJSs with undetermined TPs. We propose an innovative ternary policy iteration framework, which iteratively refines the control policies through a dynamic loop of alternating updates. This loop consists of three synergistic processes: firstly, aligning TD value functions with current policies; secondly, enhancing Q-function's matrix kernels (QFMKs) using these TD value functions; and thirdly, generating greedy policies based on the enhanced QFMKs. We demonstrate that, with a sufficient number of episodes, the TD value functions, QFMKs, and control policies converge optimally within this iterative loop. To illustrate efficiency of the developed approach, we introduce a numerical example that highlights its substantial benefits through a thorough comparison with current learning control methods for MJSs. Moreover, a structured population dynamics model for pests is utilized to validate the practical applicability.</p>","PeriodicalId":94059,"journal":{"name":"ISA transactions","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A general TD-Q learning control approach for discrete-time Markov jump systems.\",\"authors\":\"Jiwei Wen, Huiwen Xue, Xiaoli Luan, Peng Shi\",\"doi\":\"10.1016/j.isatra.2025.02.032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This paper develops a novel temporal difference Q (TD-Q) learning approach, designed to address the robust control challenge in discrete-time Markov jump systems (MJSs) which are characterized by entirely unknown dynamics and transition probabilities (TPs). The model-free TD-Q learning method is uniquely comprehensive, including two special cases: Q learning for MJSs with unknown dynamics, and TD learning for MJSs with undetermined TPs. We propose an innovative ternary policy iteration framework, which iteratively refines the control policies through a dynamic loop of alternating updates. This loop consists of three synergistic processes: firstly, aligning TD value functions with current policies; secondly, enhancing Q-function's matrix kernels (QFMKs) using these TD value functions; and thirdly, generating greedy policies based on the enhanced QFMKs. We demonstrate that, with a sufficient number of episodes, the TD value functions, QFMKs, and control policies converge optimally within this iterative loop. To illustrate efficiency of the developed approach, we introduce a numerical example that highlights its substantial benefits through a thorough comparison with current learning control methods for MJSs. Moreover, a structured population dynamics model for pests is utilized to validate the practical applicability.</p>\",\"PeriodicalId\":94059,\"journal\":{\"name\":\"ISA transactions\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ISA transactions\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.isatra.2025.02.032\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISA transactions","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.isatra.2025.02.032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A general TD-Q learning control approach for discrete-time Markov jump systems.
This paper develops a novel temporal difference Q (TD-Q) learning approach, designed to address the robust control challenge in discrete-time Markov jump systems (MJSs) which are characterized by entirely unknown dynamics and transition probabilities (TPs). The model-free TD-Q learning method is uniquely comprehensive, including two special cases: Q learning for MJSs with unknown dynamics, and TD learning for MJSs with undetermined TPs. We propose an innovative ternary policy iteration framework, which iteratively refines the control policies through a dynamic loop of alternating updates. This loop consists of three synergistic processes: firstly, aligning TD value functions with current policies; secondly, enhancing Q-function's matrix kernels (QFMKs) using these TD value functions; and thirdly, generating greedy policies based on the enhanced QFMKs. We demonstrate that, with a sufficient number of episodes, the TD value functions, QFMKs, and control policies converge optimally within this iterative loop. To illustrate efficiency of the developed approach, we introduce a numerical example that highlights its substantial benefits through a thorough comparison with current learning control methods for MJSs. Moreover, a structured population dynamics model for pests is utilized to validate the practical applicability.