Reinforcement learning algorithm for reusable resource allocation with unknown rental time distribution

IF 6 2区管理学 Q1 OPERATIONS RESEARCH & MANAGEMENT SCIENCE

European Journal of Operational Research Pub Date : 2025-09-22 DOI:10.1016/j.ejor.2025.09.012

Ziwei Wang, Jie Song, Yixuan Liu, Jingtong Zhao

{"title":"Reinforcement learning algorithm for reusable resource allocation with unknown rental time distribution","authors":"Ziwei Wang, Jie Song, Yixuan Liu, Jingtong Zhao","doi":"10.1016/j.ejor.2025.09.012","DOIUrl":null,"url":null,"abstract":"We explore a scenario where a platform must decide on the price and type of reusable resources for sequentially arriving customers. The product is rented for a random period, during which the platform also extracts rewards based on a prearranged agreement. The expected reward varies during the usage time, and the platform aims to maximize revenue over a finite horizon. Two primary challenges are encountered: the stochastic usage time introduces uncertainty, affecting product availability, and the platform lacks initial knowledge about reward and usage time distributions. In contrast to conventional online learning, where usage time distributions are parametric, our problem allows for unknown distribution types. To overcome these challenges, we formulate the problem as a Markov decision process and model the usage time distribution using a hazard rate. We first introduce a greedy policy in the full-information setting with a provable 1/2-approximation ratio. We then develop a reinforcement learning algorithm to implement this policy when the parameters are unknown, allowing for non-parametric distributions and time-varying rewards. We further prove that the algorithm achieves sublinear regret against the greedy policy. Numerical experiments on synthetic data as well as a real dataset from TikTok demonstrate the effectiveness of our method.","PeriodicalId":55161,"journal":{"name":"European Journal of Operational Research","volume":"326 1","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Operational Research","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1016/j.ejor.2025.09.012","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPERATIONS RESEARCH & MANAGEMENT SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

We explore a scenario where a platform must decide on the price and type of reusable resources for sequentially arriving customers. The product is rented for a random period, during which the platform also extracts rewards based on a prearranged agreement. The expected reward varies during the usage time, and the platform aims to maximize revenue over a finite horizon. Two primary challenges are encountered: the stochastic usage time introduces uncertainty, affecting product availability, and the platform lacks initial knowledge about reward and usage time distributions. In contrast to conventional online learning, where usage time distributions are parametric, our problem allows for unknown distribution types. To overcome these challenges, we formulate the problem as a Markov decision process and model the usage time distribution using a hazard rate. We first introduce a greedy policy in the full-information setting with a provable 1/2-approximation ratio. We then develop a reinforcement learning algorithm to implement this policy when the parameters are unknown, allowing for non-parametric distributions and time-varying rewards. We further prove that the algorithm achieves sublinear regret against the greedy policy. Numerical experiments on synthetic data as well as a real dataset from TikTok demonstrate the effectiveness of our method.

查看原文本刊更多论文

租赁时间分布未知的可重用资源分配的强化学习算法

我们探讨了这样一个场景：平台必须为依次到达的客户决定可重用资源的价格和类型。该产品随机租赁一段时间，在此期间，平台还会根据预先安排的协议提取奖励。在不同的使用时间内，预期的回报是不同的，平台的目标是在有限的时间内实现收益最大化。遇到了两个主要的挑战：随机使用时间引入了不确定性，影响了产品的可用性，并且平台缺乏关于奖励和使用时间分布的初始知识。与使用时间分布是参数化的传统在线学习不同，我们的问题允许未知的分布类型。为了克服这些挑战，我们将问题表述为马尔可夫决策过程，并使用危险率对使用时间分布进行建模。我们首先引入了一个具有可证明的1/2近似比的全信息环境下的贪心策略。然后，我们开发了一种强化学习算法，在参数未知的情况下实现该策略，允许非参数分布和时变奖励。进一步证明了该算法对贪心策略实现了次线性后悔。在合成数据和TikTok真实数据集上的数值实验证明了我们的方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Journal of Operational Research 管理科学-运筹学与管理科学

CiteScore

11.90

自引率

9.40%

发文量

786

审稿时长

8.2 months

期刊介绍： The European Journal of Operational Research (EJOR) publishes high quality, original papers that contribute to the methodology of operational research (OR) and to the practice of decision making.