Reinforcement learning algorithm for reusable resource allocation with unknown rental time distribution

IF 6 2区 管理学 Q1 OPERATIONS RESEARCH & MANAGEMENT SCIENCE
Ziwei Wang, Jie Song, Yixuan Liu, Jingtong Zhao
{"title":"Reinforcement learning algorithm for reusable resource allocation with unknown rental time distribution","authors":"Ziwei Wang, Jie Song, Yixuan Liu, Jingtong Zhao","doi":"10.1016/j.ejor.2025.09.012","DOIUrl":null,"url":null,"abstract":"We explore a scenario where a platform must decide on the price and type of reusable resources for sequentially arriving customers. The product is rented for a random period, during which the platform also extracts rewards based on a prearranged agreement. The expected reward varies during the usage time, and the platform aims to maximize revenue over a finite horizon. Two primary challenges are encountered: the stochastic usage time introduces uncertainty, affecting product availability, and the platform lacks initial knowledge about reward and usage time distributions. In contrast to conventional online learning, where usage time distributions are parametric, our problem allows for unknown distribution types. To overcome these challenges, we formulate the problem as a Markov decision process and model the usage time distribution using a hazard rate. We first introduce a greedy policy in the full-information setting with a provable 1/2-approximation ratio. We then develop a reinforcement learning algorithm to implement this policy when the parameters are unknown, allowing for non-parametric distributions and time-varying rewards. We further prove that the algorithm achieves sublinear regret against the greedy policy. Numerical experiments on synthetic data as well as a real dataset from TikTok demonstrate the effectiveness of our method.","PeriodicalId":55161,"journal":{"name":"European Journal of Operational Research","volume":"326 1","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Operational Research","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1016/j.ejor.2025.09.012","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPERATIONS RESEARCH & MANAGEMENT SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

We explore a scenario where a platform must decide on the price and type of reusable resources for sequentially arriving customers. The product is rented for a random period, during which the platform also extracts rewards based on a prearranged agreement. The expected reward varies during the usage time, and the platform aims to maximize revenue over a finite horizon. Two primary challenges are encountered: the stochastic usage time introduces uncertainty, affecting product availability, and the platform lacks initial knowledge about reward and usage time distributions. In contrast to conventional online learning, where usage time distributions are parametric, our problem allows for unknown distribution types. To overcome these challenges, we formulate the problem as a Markov decision process and model the usage time distribution using a hazard rate. We first introduce a greedy policy in the full-information setting with a provable 1/2-approximation ratio. We then develop a reinforcement learning algorithm to implement this policy when the parameters are unknown, allowing for non-parametric distributions and time-varying rewards. We further prove that the algorithm achieves sublinear regret against the greedy policy. Numerical experiments on synthetic data as well as a real dataset from TikTok demonstrate the effectiveness of our method.
租赁时间分布未知的可重用资源分配的强化学习算法
我们探讨了这样一个场景:平台必须为依次到达的客户决定可重用资源的价格和类型。该产品随机租赁一段时间,在此期间,平台还会根据预先安排的协议提取奖励。在不同的使用时间内,预期的回报是不同的,平台的目标是在有限的时间内实现收益最大化。遇到了两个主要的挑战:随机使用时间引入了不确定性,影响了产品的可用性,并且平台缺乏关于奖励和使用时间分布的初始知识。与使用时间分布是参数化的传统在线学习不同,我们的问题允许未知的分布类型。为了克服这些挑战,我们将问题表述为马尔可夫决策过程,并使用危险率对使用时间分布进行建模。我们首先引入了一个具有可证明的1/2近似比的全信息环境下的贪心策略。然后,我们开发了一种强化学习算法,在参数未知的情况下实现该策略,允许非参数分布和时变奖励。进一步证明了该算法对贪心策略实现了次线性后悔。在合成数据和TikTok真实数据集上的数值实验证明了我们的方法的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
European Journal of Operational Research
European Journal of Operational Research 管理科学-运筹学与管理科学
CiteScore
11.90
自引率
9.40%
发文量
786
审稿时长
8.2 months
期刊介绍: The European Journal of Operational Research (EJOR) publishes high quality, original papers that contribute to the methodology of operational research (OR) and to the practice of decision making.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信