Compression of Optimal Value Functions for Markov Decision Processes

2013 Data Compression Conference Pub Date : 2013-03-01 DOI:10.1109/DCC.2013.81

Mykel J. Kochenderfer, Nicholas Monath

{"title":"Compression of Optimal Value Functions for Markov Decision Processes","authors":"Mykel J. Kochenderfer, Nicholas Monath","doi":"10.1109/DCC.2013.81","DOIUrl":null,"url":null,"abstract":"Summary form only given. A Markov decision process (MDP) is defined by a state space, action space, transition model, and reward model. The objective is to maximize accumulation of reward over time. Solutions can be found through dynamic programming, which generally involves discretization, resulting in significant memory and computational requirements. Although computer clusters can be used to solve large problems, many applications require that solutions be executed on less capable hardware. We explored a general method for compressing solutions in a way that preserves fast random-access lookups. The method was applied to an MDP for an aircraft collision avoidance system. In our problem, S consists of aircraft positions and velocities and A consists of resolution advisories provided by the collision avoidance system, with S > 1.5 x 106, and A = 10. The solution to an MDP can be represented by an |S| x |A| matrix specifying Q*(s,a), the expected return of the optimal strategy from s after executing action a. Since, on average, only 6.6 actions are available from every state in our problem, it is more efficient to use a sparse representation consisting of an array of the permissible values of Q*, organized into into variable-length blocks with one block per state. An index provides offsets into this Q* array corresponding to the block boundaries, and an action array lists the actions available from each state. The values for Q* are stored using a 32-bit floating point representation, resulting in 534 MB for the three arrays associated with the sparse representation. Our method first converts to a 16-bit half-precision representation, sorts the state-action values within each block, adjusts the action array appropriately, and then removes redundant blocks. Although LZMA has a better compression ratio, it does not support real-time random access decompression. The behavior of the proposed method was demonstrated in simulation with negligible impact on safety and operational performance metrics. Although this compression methodology was demonstrated on related MDPs with similar compression ratios, further work will apply this technique to other domains.","PeriodicalId":388717,"journal":{"name":"2013 Data Compression Conference","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2013.81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Summary form only given. A Markov decision process (MDP) is defined by a state space, action space, transition model, and reward model. The objective is to maximize accumulation of reward over time. Solutions can be found through dynamic programming, which generally involves discretization, resulting in significant memory and computational requirements. Although computer clusters can be used to solve large problems, many applications require that solutions be executed on less capable hardware. We explored a general method for compressing solutions in a way that preserves fast random-access lookups. The method was applied to an MDP for an aircraft collision avoidance system. In our problem, S consists of aircraft positions and velocities and A consists of resolution advisories provided by the collision avoidance system, with S > 1.5 x 106, and A = 10. The solution to an MDP can be represented by an |S| x |A| matrix specifying Q*(s,a), the expected return of the optimal strategy from s after executing action a. Since, on average, only 6.6 actions are available from every state in our problem, it is more efficient to use a sparse representation consisting of an array of the permissible values of Q*, organized into into variable-length blocks with one block per state. An index provides offsets into this Q* array corresponding to the block boundaries, and an action array lists the actions available from each state. The values for Q* are stored using a 32-bit floating point representation, resulting in 534 MB for the three arrays associated with the sparse representation. Our method first converts to a 16-bit half-precision representation, sorts the state-action values within each block, adjusts the action array appropriately, and then removes redundant blocks. Although LZMA has a better compression ratio, it does not support real-time random access decompression. The behavior of the proposed method was demonstrated in simulation with negligible impact on safety and operational performance metrics. Although this compression methodology was demonstrated on related MDPs with similar compression ratios, further work will apply this technique to other domains.

查看原文本刊更多论文

马尔可夫决策过程的最优值函数压缩

只提供摘要形式。马尔可夫决策过程由状态空间、动作空间、转移模型和奖励模型定义。目标是随着时间的推移最大化奖励的积累。可以通过动态规划找到解决方案，这通常涉及离散化，导致大量内存和计算需求。尽管计算机集群可用于解决大型问题，但许多应用程序要求在性能较差的硬件上执行解决方案。我们探索了一种压缩解决方案的通用方法，以保持快速随机访问查找。将该方法应用于某型飞机避碰系统的MDP中。在我们的问题中，S为飞机位置和速度，A为避碰系统提供的解决建议，S > 1.5 x 106, A = 10。MDP的解决方案可以用一个指定Q*(S, A)的|S| x |A|矩阵来表示，Q*(S, A)是执行动作A后从S得到的最优策略的预期回报。因为，在我们的问题中，每个状态平均只有6.6个动作可用，所以使用由Q*的允许值数组组成的稀疏表示更有效，这些值被组织成可变长度的块，每个状态一个块。索引提供了与块边界相对应的Q*数组的偏移量，动作数组列出了每个状态中可用的动作。Q*的值使用32位浮点表示进行存储，因此与稀疏表示相关联的三个数组占用534 MB。我们的方法首先转换为16位半精度表示，对每个块中的状态-动作值进行排序，适当地调整动作数组，然后删除多余的块。虽然LZMA具有更好的压缩比，但它不支持实时随机访问解压缩。所提出的方法的行为在仿真中得到了证明，对安全和运行性能指标的影响可以忽略不计。尽管这种压缩方法在具有相似压缩比的相关mdp上得到了证明，但进一步的工作将把这种技术应用于其他领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 Data Compression Conference

自引率

0.00%

发文量