Peng Liu, Jiawei Zhu, Cong Xu, Ming Zhao, Bin Wang
{"title":"An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems","authors":"Peng Liu, Jiawei Zhu, Cong Xu, Ming Zhao, Bin Wang","doi":"arxiv-2409.11678","DOIUrl":null,"url":null,"abstract":"As the last key stage of Recommender Systems (RSs), Multi-Task Fusion (MTF)\nis in charge of combining multiple scores predicted by Multi-Task Learning\n(MTL) into a final score to maximize user satisfaction, which decides the\nultimate recommendation results. In recent years, to maximize long-term user\nsatisfaction within a recommendation session, Reinforcement Learning (RL) is\nwidely used for MTF in large-scale RSs. However, limited by their modeling\npattern, all the current RL-MTF methods can only utilize user features as the\nstate to generate actions for each user, but unable to make use of item\nfeatures and other valuable features, which leads to suboptimal results.\nAddressing this problem is a challenge that requires breaking through the\ncurrent modeling pattern of RL-MTF. To solve this problem, we propose a novel\nmethod called Enhanced-State RL for MTF in RSs. Unlike the existing methods\nmentioned above, our method first defines user features, item features, and\nother valuable features collectively as the enhanced state; then proposes a\nnovel actor and critic learning process to utilize the enhanced state to make\nmuch better action for each user-item pair. To the best of our knowledge, this\nnovel modeling pattern is being proposed for the first time in the field of\nRL-MTF. We conduct extensive offline and online experiments in a large-scale\nRS. The results demonstrate that our model outperforms other models\nsignificantly. Enhanced-State RL has been fully deployed in our RS more than\nhalf a year, improving +3.84% user valid consumption and +0.58% user duration\ntime compared to baseline.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11678","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
As the last key stage of Recommender Systems (RSs), Multi-Task Fusion (MTF)
is in charge of combining multiple scores predicted by Multi-Task Learning
(MTL) into a final score to maximize user satisfaction, which decides the
ultimate recommendation results. In recent years, to maximize long-term user
satisfaction within a recommendation session, Reinforcement Learning (RL) is
widely used for MTF in large-scale RSs. However, limited by their modeling
pattern, all the current RL-MTF methods can only utilize user features as the
state to generate actions for each user, but unable to make use of item
features and other valuable features, which leads to suboptimal results.
Addressing this problem is a challenge that requires breaking through the
current modeling pattern of RL-MTF. To solve this problem, we propose a novel
method called Enhanced-State RL for MTF in RSs. Unlike the existing methods
mentioned above, our method first defines user features, item features, and
other valuable features collectively as the enhanced state; then proposes a
novel actor and critic learning process to utilize the enhanced state to make
much better action for each user-item pair. To the best of our knowledge, this
novel modeling pattern is being proposed for the first time in the field of
RL-MTF. We conduct extensive offline and online experiments in a large-scale
RS. The results demonstrate that our model outperforms other models
significantly. Enhanced-State RL has been fully deployed in our RS more than
half a year, improving +3.84% user valid consumption and +0.58% user duration
time compared to baseline.