Jianghui Sang;Yongli Wang;Zaki Ahmad Khan;Xiaoliang Zhou
{"title":"基于Optimal-Policy-Free的奖励塑造","authors":"Jianghui Sang;Yongli Wang;Zaki Ahmad Khan;Xiaoliang Zhou","doi":"10.1109/TBDATA.2024.3489415","DOIUrl":null,"url":null,"abstract":"Existing research on potential-based reward shaping (PBRS) relies on optimal policy in Markov decision process (MDP) where optimal policy is regarded as the ground truth. However, in some practical application scenarios, there is an extrapolation error challenge between the computed optimal policy and the real-world optimal policy. At this time, the optimal policy is unreliable. To address this challenge, we design a Reward Shaping based on Optimal-Policy-Free to get rid of the dependence on the optimal policy. We view reinforcement learning as probabilistic inference on a directed graph. Essentially, this inference propagates information from the rewarding states in the MDP and results in a function which is leveraged as a potential function for PBRS. Our approach utilizes a contrastive learning technique on directed graph Laplacian. Here, this technique does not change the structure of the directed graph. Then, the directed graph Laplacian is used to approximate the true state transition matrix in MDP. The potential function in PBRS can be learned through the message passing mechanism which is built on this directed graph Laplacian. The experiments on Atari, MuJoCo and MiniWorld show that our approach outperforms the competitive algorithms.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 4","pages":"1787-1798"},"PeriodicalIF":5.7000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reward Shaping Based on Optimal-Policy-Free\",\"authors\":\"Jianghui Sang;Yongli Wang;Zaki Ahmad Khan;Xiaoliang Zhou\",\"doi\":\"10.1109/TBDATA.2024.3489415\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing research on potential-based reward shaping (PBRS) relies on optimal policy in Markov decision process (MDP) where optimal policy is regarded as the ground truth. However, in some practical application scenarios, there is an extrapolation error challenge between the computed optimal policy and the real-world optimal policy. At this time, the optimal policy is unreliable. To address this challenge, we design a Reward Shaping based on Optimal-Policy-Free to get rid of the dependence on the optimal policy. We view reinforcement learning as probabilistic inference on a directed graph. Essentially, this inference propagates information from the rewarding states in the MDP and results in a function which is leveraged as a potential function for PBRS. Our approach utilizes a contrastive learning technique on directed graph Laplacian. Here, this technique does not change the structure of the directed graph. Then, the directed graph Laplacian is used to approximate the true state transition matrix in MDP. The potential function in PBRS can be learned through the message passing mechanism which is built on this directed graph Laplacian. The experiments on Atari, MuJoCo and MiniWorld show that our approach outperforms the competitive algorithms.\",\"PeriodicalId\":13106,\"journal\":{\"name\":\"IEEE Transactions on Big Data\",\"volume\":\"11 4\",\"pages\":\"1787-1798\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2024-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Big Data\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10740180/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10740180/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Existing research on potential-based reward shaping (PBRS) relies on optimal policy in Markov decision process (MDP) where optimal policy is regarded as the ground truth. However, in some practical application scenarios, there is an extrapolation error challenge between the computed optimal policy and the real-world optimal policy. At this time, the optimal policy is unreliable. To address this challenge, we design a Reward Shaping based on Optimal-Policy-Free to get rid of the dependence on the optimal policy. We view reinforcement learning as probabilistic inference on a directed graph. Essentially, this inference propagates information from the rewarding states in the MDP and results in a function which is leveraged as a potential function for PBRS. Our approach utilizes a contrastive learning technique on directed graph Laplacian. Here, this technique does not change the structure of the directed graph. Then, the directed graph Laplacian is used to approximate the true state transition matrix in MDP. The potential function in PBRS can be learned through the message passing mechanism which is built on this directed graph Laplacian. The experiments on Atari, MuJoCo and MiniWorld show that our approach outperforms the competitive algorithms.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.