Offline Reinforcement Learning for Learning to Dispatch for Job Shop Scheduling

arXiv - CS - Machine Learning Pub Date : 2024-09-16 DOI:arxiv-2409.10589

Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang

{"title":"Offline Reinforcement Learning for Learning to Dispatch for Job Shop Scheduling","authors":"Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang","doi":"arxiv-2409.10589","DOIUrl":null,"url":null,"abstract":"The Job Shop Scheduling Problem (JSSP) is a complex combinatorial\noptimization problem. There has been growing interest in using online\nReinforcement Learning (RL) for JSSP. While online RL can quickly find\nacceptable solutions, especially for larger problems, it produces lower-quality\nresults than traditional methods like Constraint Programming (CP). A\nsignificant downside of online RL is that it cannot learn from existing data,\nsuch as solutions generated from CP, requiring them to train from scratch,\nleading to sample inefficiency and making them unable to learn from more\noptimal examples. We introduce Offline Reinforcement Learning for Learning to\nDispatch (Offline-LD), a novel approach for JSSP that addresses these\nlimitations. Offline-LD adapts two CQL-based Q-learning methods (mQRDQN and\ndiscrete mSAC) for maskable action spaces, introduces a new entropy bonus\nmodification for discrete SAC, and exploits reward normalization through\npreprocessing. Our experiments show that Offline-LD outperforms online RL on\nboth generated and benchmark instances. By introducing noise into the dataset,\nwe achieve similar or better results than those obtained from the expert\ndataset, indicating that a more diverse training set is preferable because it\ncontains counterfactual information.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":"77 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10589","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The Job Shop Scheduling Problem (JSSP) is a complex combinatorial optimization problem. There has been growing interest in using online Reinforcement Learning (RL) for JSSP. While online RL can quickly find acceptable solutions, especially for larger problems, it produces lower-quality results than traditional methods like Constraint Programming (CP). A significant downside of online RL is that it cannot learn from existing data, such as solutions generated from CP, requiring them to train from scratch, leading to sample inefficiency and making them unable to learn from more optimal examples. We introduce Offline Reinforcement Learning for Learning to Dispatch (Offline-LD), a novel approach for JSSP that addresses these limitations. Offline-LD adapts two CQL-based Q-learning methods (mQRDQN and discrete mSAC) for maskable action spaces, introduces a new entropy bonus modification for discrete SAC, and exploits reward normalization through preprocessing. Our experiments show that Offline-LD outperforms online RL on both generated and benchmark instances. By introducing noise into the dataset, we achieve similar or better results than those obtained from the expert dataset, indicating that a more diverse training set is preferable because it contains counterfactual information.

查看原文本刊更多论文

离线强化学习用于工作车间调度的调度学习

作业车间调度问题（JSSP）是一个复杂的组合优化问题。人们对在 JSSP 中使用在线强化学习（RL）越来越感兴趣。虽然在线强化学习可以快速找到可接受的解决方案，特别是对于较大的问题，但与约束编程（CP）等传统方法相比，它产生的结果质量较低。在线 RL 的一个显著缺点是它不能从现有数据中学习，例如从 CP 生成的解决方案，这就要求它们从头开始训练，从而导致样本效率低下，并且无法从更优化的示例中学习。我们引入了离线强化学习（Offline Reinforcement Learning for Learning toDispatch，简称 Offline-LD），这是一种用于 JSSP 的新方法，可以解决上述限制。Offline-LD 对两种基于 CQL 的 Q-learning 方法（mQRDQN 和离散 mSAC）进行了调整，适用于可掩蔽的行动空间，为离散 SAC 引入了一种新的熵奖励修正，并通过预处理利用奖励归一化。我们的实验表明，在生成实例和基准实例上，离线-LD 的表现都优于在线 RL。通过在数据集中引入噪声，我们获得了与从专家数据集中获得的结果相似甚至更好的结果，这表明更多样化的训练集是更可取的，因为它包含了反事实信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Machine Learning

自引率

0.00%

发文量