Feng Liu, Huifeng Guo, Xutao Li, Ruiming Tang, Yunming Ye, Xiuqiang He
{"title":"End-to-End Deep Reinforcement Learning based Recommendation with Supervised Embedding","authors":"Feng Liu, Huifeng Guo, Xutao Li, Ruiming Tang, Yunming Ye, Xiuqiang He","doi":"10.1145/3336191.3371858","DOIUrl":null,"url":null,"abstract":"The research of reinforcement learning (RL) based recommendation method has become a hot topic in recommendation community, due to the recent advance in interactive recommender systems. The existing RL recommendation approaches can be summarized into a unified framework with three components, namely embedding component (EC), state representation component (SRC) and policy component (PC). We find that EC cannot be nicely trained with the other two components simultaneously. Previous studies bypass the obstacle through a pre-training and fixing strategy, which makes their approaches unlike a real end-to-end fashion. More importantly, such pre-trained and fixed EC suffers from two inherent drawbacks: (1) Pre-trained and fixed embeddings are unable to model evolving preference of users and item correlations in the dynamic environment; (2) Pre-training is inconvenient in the industrial applications. To address the problem, in this paper, we propose an End-to-end Deep Reinforcement learning based Recommendation framework (EDRR). In this framework, a supervised learning signal is carefully designed for smoothing the update gradients to EC, and three incorporating ways are introduced and compared. To the best of our knowledge, we are the first to address the training compatibility between the three components in RL based recommendations. Extensive experiments are conducted on three real-world datasets, and the results demonstrate the proposed EDRR effectively achieves the end-to-end training purpose for both policy-based and value-based RL models, and delivers better performance than state-of-the-art methods.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3336191.3371858","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 35
Abstract
The research of reinforcement learning (RL) based recommendation method has become a hot topic in recommendation community, due to the recent advance in interactive recommender systems. The existing RL recommendation approaches can be summarized into a unified framework with three components, namely embedding component (EC), state representation component (SRC) and policy component (PC). We find that EC cannot be nicely trained with the other two components simultaneously. Previous studies bypass the obstacle through a pre-training and fixing strategy, which makes their approaches unlike a real end-to-end fashion. More importantly, such pre-trained and fixed EC suffers from two inherent drawbacks: (1) Pre-trained and fixed embeddings are unable to model evolving preference of users and item correlations in the dynamic environment; (2) Pre-training is inconvenient in the industrial applications. To address the problem, in this paper, we propose an End-to-end Deep Reinforcement learning based Recommendation framework (EDRR). In this framework, a supervised learning signal is carefully designed for smoothing the update gradients to EC, and three incorporating ways are introduced and compared. To the best of our knowledge, we are the first to address the training compatibility between the three components in RL based recommendations. Extensive experiments are conducted on three real-world datasets, and the results demonstrate the proposed EDRR effectively achieves the end-to-end training purpose for both policy-based and value-based RL models, and delivers better performance than state-of-the-art methods.