Coordinating ride-pooling with public transit using Reward-Guided Conservative Q-Learning: An offline training and online fine-tuning reinforcement learning framework

IF 7.6 1区工程技术 Q1 TRANSPORTATION SCIENCE & TECHNOLOGY

Transportation Research Part C-Emerging Technologies Pub Date : 2025-03-13 DOI:10.1016/j.trc.2025.105051

Yulong Hu , Tingting Dong , Sen Li

{"title":"Coordinating ride-pooling with public transit using Reward-Guided Conservative Q-Learning: An offline training and online fine-tuning reinforcement learning framework","authors":"Yulong Hu , Tingting Dong , Sen Li","doi":"10.1016/j.trc.2025.105051","DOIUrl":null,"url":null,"abstract":"<div><div>This paper introduces a novel reinforcement learning (RL) framework, termed Reward-Guided Conservative Q-learning (RG-CQL), to enhance coordination between ride-pooling and public transit within a multimodal transportation network. We model each ride-pooling vehicle as an agent governed by a Markov Decision Process (MDP), which includes a state for each agent encompassing the vehicle’s location, the number of vacant seats, and all pertinent information regarding the passengers on board. We propose an offline training and online fine-tuning RL framework to learn the optimal operational decisions of the multimodal transportation systems, including rider-vehicle matching, selection of drop-off locations for passengers, and vehicle routing decisions, with improved data efficiency. During the offline training phase, we develop a Conservative Double Deep Q Network (CDDQN) as the action executor and a supervised learning-based reward estimator, termed the Guider Network, to extract valuable insights into action-reward relationships from data batches. In the online fine-tuning phase, the Guider Network serves as an exploration guide, aiding CDDQN in effectively and conservatively exploring unknown state–action pairs to bridge the gap between the conservative offline training and optimistic online fine-tuning. The efficacy of our algorithm is demonstrated through a realistic case study using real-world data from Manhattan. We show that integrating ride-pooling with public transit outperforms two benchmark cases—solo rides coordinated with transit and ride-pooling without transit coordination—by 17% and 22% in the achieved system rewards, respectively. Furthermore, our innovative offline training and online fine-tuning framework offers a remarkable 81.3% improvement in data efficiency compared to traditional online RL methods with adequate exploration budgets, with a 4.3% increase in total rewards and a 5.6% reduction in overestimation errors. Experimental results further demonstrate that RG-CQL effectively addresses the challenges of transitioning from offline to online RL in large-scale ride-pooling systems integrated with transit.</div></div>","PeriodicalId":54417,"journal":{"name":"Transportation Research Part C-Emerging Technologies","volume":"174 ","pages":"Article 105051"},"PeriodicalIF":7.6000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transportation Research Part C-Emerging Technologies","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0968090X25000555","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"TRANSPORTATION SCIENCE & TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

This paper introduces a novel reinforcement learning (RL) framework, termed Reward-Guided Conservative Q-learning (RG-CQL), to enhance coordination between ride-pooling and public transit within a multimodal transportation network. We model each ride-pooling vehicle as an agent governed by a Markov Decision Process (MDP), which includes a state for each agent encompassing the vehicle’s location, the number of vacant seats, and all pertinent information regarding the passengers on board. We propose an offline training and online fine-tuning RL framework to learn the optimal operational decisions of the multimodal transportation systems, including rider-vehicle matching, selection of drop-off locations for passengers, and vehicle routing decisions, with improved data efficiency. During the offline training phase, we develop a Conservative Double Deep Q Network (CDDQN) as the action executor and a supervised learning-based reward estimator, termed the Guider Network, to extract valuable insights into action-reward relationships from data batches. In the online fine-tuning phase, the Guider Network serves as an exploration guide, aiding CDDQN in effectively and conservatively exploring unknown state–action pairs to bridge the gap between the conservative offline training and optimistic online fine-tuning. The efficacy of our algorithm is demonstrated through a realistic case study using real-world data from Manhattan. We show that integrating ride-pooling with public transit outperforms two benchmark cases—solo rides coordinated with transit and ride-pooling without transit coordination—by 17% and 22% in the achieved system rewards, respectively. Furthermore, our innovative offline training and online fine-tuning framework offers a remarkable 81.3% improvement in data efficiency compared to traditional online RL methods with adequate exploration budgets, with a 4.3% increase in total rewards and a 5.6% reduction in overestimation errors. Experimental results further demonstrate that RG-CQL effectively addresses the challenges of transitioning from offline to online RL in large-scale ride-pooling systems integrated with transit.

查看原文本刊更多论文

使用奖励引导的保守Q-Learning协调拼车与公共交通：一种离线训练和在线微调强化学习框架

本文介绍了一种新的强化学习（RL）框架，称为奖励引导保守q学习（RG-CQL），以增强多式联运网络中拼车和公共交通之间的协调。我们将每辆拼车建模为一个由马尔可夫决策过程（MDP）管理的代理，其中包括每个代理的状态，包括车辆的位置、空置座位的数量以及关于车上乘客的所有相关信息。我们提出了一个离线训练和在线微调RL框架，以学习多式联运系统的最佳运营决策，包括乘客-车辆匹配，乘客下车地点选择和车辆路线决策，提高了数据效率。在离线训练阶段，我们开发了一个保守的双深度Q网络（CDDQN）作为动作执行器和一个基于监督学习的奖励估计器，称为引导网络，以从数据批次中提取对动作-奖励关系的有价值的见解。在在线微调阶段，Guider Network作为探索向导，帮助CDDQN有效且保守地探索未知状态-动作对，以弥补保守的离线训练和乐观的在线微调之间的差距。通过使用来自曼哈顿的真实数据的实际案例研究，证明了我们算法的有效性。我们表明，将拼车与公共交通相结合，在实现的系统奖励方面，分别比两种基准情况（与公共交通协调的单独乘车和没有公共交通协调的拼车）高出17%和22%。此外，与传统的在线RL方法相比，我们创新的离线培训和在线微调框架在充足的勘探预算下，数据效率提高了81.3%，总奖励增加了4.3%，高估误差减少了5.6%。实验结果进一步表明，RG-CQL有效地解决了大规模拼车系统中从离线到在线RL的过渡挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Transportation Research Part C-Emerging Technologies 工程技术-运输科技

CiteScore

15.80

自引率

12.00%

发文量

332

审稿时长

64 days

期刊介绍： Transportation Research: Part C (TR_C) is dedicated to showcasing high-quality, scholarly research that delves into the development, applications, and implications of transportation systems and emerging technologies. Our focus lies not solely on individual technologies, but rather on their broader implications for the planning, design, operation, control, maintenance, and rehabilitation of transportation systems, services, and components. In essence, the intellectual core of the journal revolves around the transportation aspect rather than the technology itself. We actively encourage the integration of quantitative methods from diverse fields such as operations research, control systems, complex networks, computer science, and artificial intelligence. Join us in exploring the intersection of transportation systems and emerging technologies to drive innovation and progress in the field.