QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds

Igor Halperin
{"title":"QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds","authors":"Igor Halperin","doi":"10.3905/jod.2020.1.108","DOIUrl":null,"url":null,"abstract":"This article presents a discrete-time option pricing model that is rooted in reinforcement learning (RL), and more specifically in the famous Q-Learning method of RL. We construct a risk-adjusted Markov Decision Process for a discrete-time version of the classical Black-Scholes-Merton (BSM) model, where the option price is an optimal Q-function, while the optimal hedge is a second argument of this optimal Q-function, so that both the price and hedge are parts of the same formula. Pricing is done by learning to dynamically optimize risk-adjusted returns for an option replicating portfolio, as in Markowitz portfolio theory. Using Q-Learning and related methods, once created in a parametric setting, the model can go model-free and learn to price and hedge an option directly from data, without an explicit model of the world. This suggests that RL may provide efficient data-driven and model-free methods for the optimal pricing and hedging of options. Once we depart from the academic continuous-time limit, and vice versa, option pricing methods developed in Mathematical Finance may be viewed as special cases of model-based reinforcement learning. Further, due to the simplicity and tractability of our model, which only needs basic linear algebra (plus Monte Carlo simulation, if we work with synthetic data), and its close relationship to the original BSM model, we suggest that our model could be used in the benchmarking of different RL algorithms for financial trading applications. TOPICS: Derivatives, options Key Findings • Reinforcement learning (RL) is the most natural way for pricing and hedging of options that relies directly on data and not on a specific model of asset pricing. • The discrete-time RL approach to option pricing generalizes classical continuous-time methods; enables tracking mis-hedging risk, which disappears in the formal continuous-time limit; and provides a consistent framework for using options for both hedging and speculation. • A simple quadratic reward function, which presents a minimal extension of the classical Black-Scholes framework when combined with the Q-learning method of RL, gives rise to a particularly simple computational scheme where option pricing and hedging are semianalytical, as they amount to multiple uses of a conventional least-squares regression.","PeriodicalId":501089,"journal":{"name":"The Journal of Derivatives","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of Derivatives","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3905/jod.2020.1.108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This article presents a discrete-time option pricing model that is rooted in reinforcement learning (RL), and more specifically in the famous Q-Learning method of RL. We construct a risk-adjusted Markov Decision Process for a discrete-time version of the classical Black-Scholes-Merton (BSM) model, where the option price is an optimal Q-function, while the optimal hedge is a second argument of this optimal Q-function, so that both the price and hedge are parts of the same formula. Pricing is done by learning to dynamically optimize risk-adjusted returns for an option replicating portfolio, as in Markowitz portfolio theory. Using Q-Learning and related methods, once created in a parametric setting, the model can go model-free and learn to price and hedge an option directly from data, without an explicit model of the world. This suggests that RL may provide efficient data-driven and model-free methods for the optimal pricing and hedging of options. Once we depart from the academic continuous-time limit, and vice versa, option pricing methods developed in Mathematical Finance may be viewed as special cases of model-based reinforcement learning. Further, due to the simplicity and tractability of our model, which only needs basic linear algebra (plus Monte Carlo simulation, if we work with synthetic data), and its close relationship to the original BSM model, we suggest that our model could be used in the benchmarking of different RL algorithms for financial trading applications. TOPICS: Derivatives, options Key Findings • Reinforcement learning (RL) is the most natural way for pricing and hedging of options that relies directly on data and not on a specific model of asset pricing. • The discrete-time RL approach to option pricing generalizes classical continuous-time methods; enables tracking mis-hedging risk, which disappears in the formal continuous-time limit; and provides a consistent framework for using options for both hedging and speculation. • A simple quadratic reward function, which presents a minimal extension of the classical Black-Scholes framework when combined with the Q-learning method of RL, gives rise to a particularly simple computational scheme where option pricing and hedging are semianalytical, as they amount to multiple uses of a conventional least-squares regression.
QLBS: Black-Scholes(-Merton)世界中的Q-Learner
本文提出了一个基于强化学习(RL)的离散时间期权定价模型,更具体地说,是基于著名的强化学习(RL)的Q-Learning方法。我们为经典的Black-Scholes-Merton (BSM)模型的离散时间版本构建了一个风险调整的马尔可夫决策过程,其中期权价格是最优q函数,而最优套期保值是最优q函数的第二个参数,因此价格和套期保值都是同一个公式的一部分。定价是通过学习动态优化一个期权复制投资组合的风险调整收益来完成的,正如马科维茨投资组合理论。使用Q-Learning和相关方法,一旦在参数设置中创建,模型就可以脱离模型,学习直接从数据中定价和对冲期权,而无需明确的世界模型。这表明RL可以为期权的最优定价和对冲提供有效的数据驱动和无模型方法。一旦我们离开了学术上的连续时间限制,反之亦然,数学金融中开发的期权定价方法可能被视为基于模型的强化学习的特殊情况。此外,由于我们的模型的简单性和可追溯性,它只需要基本的线性代数(加上蒙特卡罗模拟,如果我们使用合成数据),以及它与原始BSM模型的密切关系,我们建议我们的模型可以用于金融交易应用程序中不同强化学习算法的基准测试。•强化学习(RL)是期权定价和对冲最自然的方法,它直接依赖于数据,而不是特定的资产定价模型。•离散时间RL期权定价方法是经典连续时间期权定价方法的推广;可跟踪误对冲风险,该风险在正式的连续时间限制中消失;并为使用期权进行对冲和投机提供了一致的框架。•一个简单的二次奖励函数,当与RL的Q-learning方法相结合时,它呈现了经典Black-Scholes框架的最小扩展,产生了一个特别简单的计算方案,其中期权定价和套期保值是半分析的,因为它们相当于传统最小二乘回归的多次使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信