{"title":"Research and Application of Reinforcement Learning Recommendation Method for Taobao","authors":"Lan Huang, Xiaofang Zhang, Yan Wang, Xuping Xie","doi":"10.1109/ISCC53001.2021.9631429","DOIUrl":null,"url":null,"abstract":"Nowadays, many e-commerce companies are using reinforcement learning recommendation methods to maximize long-term benefits. Alibaba Group and Nanjing University build “Virtual Taobao”, a Taobao simulator. In this paper, we proposed TTD3 based on TD3 and trained it in Virtual Taobao. There are three important improvements in TTD3's training process. First, the current actor-network and target actor-network will predict two candidate actions for Virtual Taobao's current state, and the action with a larger value evaluated by the current critic-network is selected as the final execution action. Second, the Ornstein-Uhlenbeck (OU) process is used as the exploration noise to improve the agent's ability to explore Virtual Taobao. Third, prioritized experience replay is adopted to improve sampling efficiency. TTD3 achieves the highest average CTR of about 0.85 in Virtual Taobao which is superior to TD3 as well as DPPO, SAC, and DDPG used by Virtual Taobao's author.","PeriodicalId":270786,"journal":{"name":"2021 IEEE Symposium on Computers and Communications (ISCC)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Symposium on Computers and Communications (ISCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCC53001.2021.9631429","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Nowadays, many e-commerce companies are using reinforcement learning recommendation methods to maximize long-term benefits. Alibaba Group and Nanjing University build “Virtual Taobao”, a Taobao simulator. In this paper, we proposed TTD3 based on TD3 and trained it in Virtual Taobao. There are three important improvements in TTD3's training process. First, the current actor-network and target actor-network will predict two candidate actions for Virtual Taobao's current state, and the action with a larger value evaluated by the current critic-network is selected as the final execution action. Second, the Ornstein-Uhlenbeck (OU) process is used as the exploration noise to improve the agent's ability to explore Virtual Taobao. Third, prioritized experience replay is adopted to improve sampling efficiency. TTD3 achieves the highest average CTR of about 0.85 in Virtual Taobao which is superior to TD3 as well as DPPO, SAC, and DDPG used by Virtual Taobao's author.