{"title":"Investigation of Maximization Bias in Sarsa Variants","authors":"Ganesh Tata, Eric Austin","doi":"10.1109/SSCI50451.2021.9660081","DOIUrl":null,"url":null,"abstract":"The overestimation of action values caused by randomness in rewards can harm the ability to learn and the performance of reinforcement learning agents. This maximization bias has been well established and studied in the off-policy Q-learning algorithm. However, less study has been done for on-policy algorithms such as Sarsa and its variants. We conduct a thorough empirical analysis on Sarsa, Expected Sarsa, and n-step Sarsa. We find that the on-policy Sarsa variants suffer from less maximization bias than off-policy Q-learning in several test environments. We show how the choice of hyper-parameters impacts the severity of the bias. A decaying learning rate schedule results in more maximization bias than a fixed learning rate. Larger learning rates lead to larger overestimation. A larger exploration parameter leads to worse bias in Q-learning but less bias in the on-policy algorithms. We also show that a larger variance in rewards leads to more bias in both Q-Learning and Sarsa., but Sarsa is less affected than Q-learning.","PeriodicalId":255763,"journal":{"name":"2021 IEEE Symposium Series on Computational Intelligence (SSCI)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Symposium Series on Computational Intelligence (SSCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSCI50451.2021.9660081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The overestimation of action values caused by randomness in rewards can harm the ability to learn and the performance of reinforcement learning agents. This maximization bias has been well established and studied in the off-policy Q-learning algorithm. However, less study has been done for on-policy algorithms such as Sarsa and its variants. We conduct a thorough empirical analysis on Sarsa, Expected Sarsa, and n-step Sarsa. We find that the on-policy Sarsa variants suffer from less maximization bias than off-policy Q-learning in several test environments. We show how the choice of hyper-parameters impacts the severity of the bias. A decaying learning rate schedule results in more maximization bias than a fixed learning rate. Larger learning rates lead to larger overestimation. A larger exploration parameter leads to worse bias in Q-learning but less bias in the on-policy algorithms. We also show that a larger variance in rewards leads to more bias in both Q-Learning and Sarsa., but Sarsa is less affected than Q-learning.