推荐系统的非政策行为者批评家

Proceedings of the 16th ACM Conference on Recommender Systems Pub Date : 2022-09-18 DOI:10.1145/3523227.3546758

Minmin Chen, Can Xu, Vince Gatto, Devanshu Jain, Aviral Kumar, Ed H. Chi

{"title":"推荐系统的非政策行为者批评家","authors":"Minmin Chen, Can Xu, Vince Gatto, Devanshu Jain, Aviral Kumar, Ed H. Chi","doi":"10.1145/3523227.3546758","DOIUrl":null,"url":null,"abstract":"Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) combating feedback loop effect resulted from myopic system behaviors; and 2) sequential planning to optimize long term outcome. Scaling RL algorithms to production recommender systems serving billions of users and contents, however remain challenging. Sample inefficiency and instability of online RL hinder its widespread adoption in production. Offline RL enables usage of off-policy data and batch learning. It on the other hand faces significant challenges in learning due to the distribution shift. A REINFORCE agent [3] was successfully tested for YouTube recommendation, significantly outperforming a sophisticated supervised learning production system. Off-policy correction was employed to learn from logged data. The algorithm partially mitigates the distribution shift by employing a one-step importance weighting. We resort to the off-policy actor critic algorithms to addresses the distribution shift to a better extent. Here we share the key designs in setting up an off-policy actor-critic agent for production recommender systems. It extends [3] with a critic network that estimates the value of any state-action pairs under the target learned policy through temporal difference learning. We demonstrate in offline and live experiments that the new framework out-performs baseline and improves long term user experience. An interesting discovery along our investigation is that recommendation agents that employ a softmax policy parameterization, can end up being too pessimistic about out-of-distribution (OOD) actions. Finding the right balance between pessimism and optimism on OOD actions is critical to the success of offline RL for recommender systems.","PeriodicalId":443279,"journal":{"name":"Proceedings of the 16th ACM Conference on Recommender Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Off-Policy Actor-critic for Recommender Systems\",\"authors\":\"Minmin Chen, Can Xu, Vince Gatto, Devanshu Jain, Aviral Kumar, Ed H. Chi\",\"doi\":\"10.1145/3523227.3546758\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) combating feedback loop effect resulted from myopic system behaviors; and 2) sequential planning to optimize long term outcome. Scaling RL algorithms to production recommender systems serving billions of users and contents, however remain challenging. Sample inefficiency and instability of online RL hinder its widespread adoption in production. Offline RL enables usage of off-policy data and batch learning. It on the other hand faces significant challenges in learning due to the distribution shift. A REINFORCE agent [3] was successfully tested for YouTube recommendation, significantly outperforming a sophisticated supervised learning production system. Off-policy correction was employed to learn from logged data. The algorithm partially mitigates the distribution shift by employing a one-step importance weighting. We resort to the off-policy actor critic algorithms to addresses the distribution shift to a better extent. Here we share the key designs in setting up an off-policy actor-critic agent for production recommender systems. It extends [3] with a critic network that estimates the value of any state-action pairs under the target learned policy through temporal difference learning. We demonstrate in offline and live experiments that the new framework out-performs baseline and improves long term user experience. An interesting discovery along our investigation is that recommendation agents that employ a softmax policy parameterization, can end up being too pessimistic about out-of-distribution (OOD) actions. Finding the right balance between pessimism and optimism on OOD actions is critical to the success of offline RL for recommender systems.\",\"PeriodicalId\":443279,\"journal\":{\"name\":\"Proceedings of the 16th ACM Conference on Recommender Systems\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th ACM Conference on Recommender Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3523227.3546758\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM Conference on Recommender Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3523227.3546758","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

行业推荐平台越来越关注如何进行推荐，让用户在平台上享受长期的体验。强化学习作为一种吸引人的方法自然出现，因为它在以下方面有希望:1)对抗由短视系统行为引起的反馈循环效应;2)顺序规划以优化长期结果。然而，将强化学习算法扩展到为数十亿用户和内容提供服务的生产推荐系统仍然具有挑战性。在线RL的样品效率低、稳定性差，阻碍了其在生产中的广泛应用。离线强化学习允许使用非策略数据和批量学习。另一方面，由于分布的变化，它在学习方面面临着重大挑战。一个强化代理[3]被成功地用于YouTube推荐测试，显著优于一个复杂的监督学习生产系统。采用非策略校正从日志数据中学习。该算法通过采用一步重要度加权，部分缓解了分布偏移。我们采用非政策行为者批评家算法来更好地解决分布转移问题。在这里，我们分享了为制作推荐系统建立非政策参与者-评论家代理的关键设计。它将[3]扩展为一个批判网络，该网络通过时间差异学习来估计目标学习策略下任何状态-动作对的值。我们在离线和实时实验中证明，新框架优于基线，并改善了长期用户体验。在我们的调查中有一个有趣的发现，使用softmax策略参数化的推荐代理最终可能对超出分布(OOD)的行为过于悲观。在对OOD行为的悲观和乐观之间找到适当的平衡对于推荐系统的离线强化学习的成功至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Off-Policy Actor-critic for Recommender Systems

Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) combating feedback loop effect resulted from myopic system behaviors; and 2) sequential planning to optimize long term outcome. Scaling RL algorithms to production recommender systems serving billions of users and contents, however remain challenging. Sample inefficiency and instability of online RL hinder its widespread adoption in production. Offline RL enables usage of off-policy data and batch learning. It on the other hand faces significant challenges in learning due to the distribution shift. A REINFORCE agent [3] was successfully tested for YouTube recommendation, significantly outperforming a sophisticated supervised learning production system. Off-policy correction was employed to learn from logged data. The algorithm partially mitigates the distribution shift by employing a one-step importance weighting. We resort to the off-policy actor critic algorithms to addresses the distribution shift to a better extent. Here we share the key designs in setting up an off-policy actor-critic agent for production recommender systems. It extends [3] with a critic network that estimates the value of any state-action pairs under the target learned policy through temporal difference learning. We demonstrate in offline and live experiments that the new framework out-performs baseline and improves long term user experience. An interesting discovery along our investigation is that recommendation agents that employ a softmax policy parameterization, can end up being too pessimistic about out-of-distribution (OOD) actions. Finding the right balance between pessimism and optimism on OOD actions is critical to the success of offline RL for recommender systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 16th ACM Conference on Recommender Systems

自引率

0.00%

发文量