Reinforcement Learning to Rank with Pairwise Policy Gradient

Jun Xu, Zeng Wei, Long Xia, Yanyan Lan, Dawei Yin, Xueqi Cheng, Ji-rong Wen
{"title":"Reinforcement Learning to Rank with Pairwise Policy Gradient","authors":"Jun Xu, Zeng Wei, Long Xia, Yanyan Lan, Dawei Yin, Xueqi Cheng, Ji-rong Wen","doi":"10.1145/3397271.3401148","DOIUrl":null,"url":null,"abstract":"This paper concerns reinforcement learning~(RL) of the document ranking models for information retrieval~(IR). One branch of the RL approaches to ranking formalize the process of ranking with Markov decision process~(MDP) and determine the model parameters with policy gradient. Though preliminary success has been shown, these approaches are still far from achieving their full potentials. Existing policy gradient methods directly utilize the absolute performance scores (returns) of the sampled document lists in its gradient estimations, which may cause two limitations: 1) fail to reflect the relative goodness of documents within the same query, which usually is close to the nature of IR ranking; 2) generate high variance gradient estimations, resulting in slow learning speed and low ranking accuracy. To deal with the issues, we propose a novel policy gradient algorithm in which the gradients are determined using pairwise comparisons of two document lists sampled within the same query. The algorithm, referred to as Pairwise Policy Gradient (PPG), repeatedly samples pairs of document lists, estimates the gradients with pairwise comparisons, and finally updates the model parameters. Theoretical analysis shows that PPG makes an unbiased and low variance gradient estimations. Experimental results have demonstrated performance gains over the state-of-the-art baselines in search result diversification and text retrieval.","PeriodicalId":252050,"journal":{"name":"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397271.3401148","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

Abstract

This paper concerns reinforcement learning~(RL) of the document ranking models for information retrieval~(IR). One branch of the RL approaches to ranking formalize the process of ranking with Markov decision process~(MDP) and determine the model parameters with policy gradient. Though preliminary success has been shown, these approaches are still far from achieving their full potentials. Existing policy gradient methods directly utilize the absolute performance scores (returns) of the sampled document lists in its gradient estimations, which may cause two limitations: 1) fail to reflect the relative goodness of documents within the same query, which usually is close to the nature of IR ranking; 2) generate high variance gradient estimations, resulting in slow learning speed and low ranking accuracy. To deal with the issues, we propose a novel policy gradient algorithm in which the gradients are determined using pairwise comparisons of two document lists sampled within the same query. The algorithm, referred to as Pairwise Policy Gradient (PPG), repeatedly samples pairs of document lists, estimates the gradients with pairwise comparisons, and finally updates the model parameters. Theoretical analysis shows that PPG makes an unbiased and low variance gradient estimations. Experimental results have demonstrated performance gains over the state-of-the-art baselines in search result diversification and text retrieval.
基于两两策略梯度的强化学习排序
本文研究了用于信息检索的文档排序模型的强化学习。RL排序方法的一个分支是用马尔可夫决策过程(MDP)形式化排序过程,并用策略梯度确定模型参数。虽然已显示出初步的成功,但这些办法仍远未充分发挥其潜力。现有的策略梯度方法在梯度估计中直接使用采样文档列表的绝对性能分数(返回值),这可能会造成两个限制:1)不能反映同一查询中文档的相对优度,这通常接近于IR排序的性质;2)产生高方差梯度估计,导致学习速度慢,排序精度低。为了解决这个问题,我们提出了一种新的策略梯度算法,其中梯度是通过对同一查询中采样的两个文档列表进行两两比较来确定的。该算法被称为成对策略梯度(Pairwise Policy Gradient, PPG),通过对文档列表进行重复采样,通过成对比较估计梯度,最后更新模型参数。理论分析表明,PPG能得到无偏、低方差的梯度估计。实验结果表明,在搜索结果多样化和文本检索方面,性能优于最先进的基线。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信