Weber-Fechner law in temporal difference learning derived from control as inference.

IF 3 Q2 ROBOTICS

Frontiers in Robotics and AI Pub Date : 2025-09-25 eCollection Date: 2025-01-01 DOI:10.3389/frobt.2025.1649154

Keiichiro Takahashi, Taisuke Kobayashi, Tomoya Yamanokuchi, Takamitsu Matsubara

{"title":"Weber-Fechner law in temporal difference learning derived from control as inference.","authors":"Keiichiro Takahashi, Taisuke Kobayashi, Tomoya Yamanokuchi, Takamitsu Matsubara","doi":"10.3389/frobt.2025.1649154","DOIUrl":null,"url":null,"abstract":"This study investigates a novel nonlinear update rule for value and policy functions based on temporal difference (TD) errors in reinforcement learning (RL). The update rule in standard RL states that the TD error is linearly proportional to the degree of updates, treating all rewards equally without any bias. On the other hand, recent biological studies have revealed that there are nonlinearities in the TD error and the degree of updates, biasing policies towards being either optimistic or pessimistic. Such biases in learning due to nonlinearities are expected to be useful and intentionally leftover features in biological learning. Therefore, this research explores a theoretical framework that can leverage the nonlinearity between the degree of the update and TD errors. To this end, we focus on a control as inference framework utilized in the previous work, in which the uncomputable nonlinear term needed to be approximately excluded from the derivation of the standard RL. By analyzing it, the Weber-Fechner law (WFL) is found, in which perception (i.e., the degree of updates) in response to a change in stimulus (i.e., TD error) is attenuated as the stimulus intensity (i.e., the value function) increases. To numerically demonstrate the utilities of WFL on RL, we propose a practical implementation using a reward-punishment framework and modify the definition of optimality. Further analysis of this implementation reveals that two utilities can be expected: i) to accelerate escaping from the situations with small rewards and ii) to pursue the minimum punishment as much as possible. We finally investigate and discuss the expected utilities through simulations and robot experiments. As a result, the proposed RL algorithm with WFL shows the expected utilities that accelerate the reward-maximizing startup and continue to suppress punishments during learning.","PeriodicalId":47597,"journal":{"name":"Frontiers in Robotics and AI","volume":"12 ","pages":"1649154"},"PeriodicalIF":3.0000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12507549/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Robotics and AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frobt.2025.1649154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

This study investigates a novel nonlinear update rule for value and policy functions based on temporal difference (TD) errors in reinforcement learning (RL). The update rule in standard RL states that the TD error is linearly proportional to the degree of updates, treating all rewards equally without any bias. On the other hand, recent biological studies have revealed that there are nonlinearities in the TD error and the degree of updates, biasing policies towards being either optimistic or pessimistic. Such biases in learning due to nonlinearities are expected to be useful and intentionally leftover features in biological learning. Therefore, this research explores a theoretical framework that can leverage the nonlinearity between the degree of the update and TD errors. To this end, we focus on a control as inference framework utilized in the previous work, in which the uncomputable nonlinear term needed to be approximately excluded from the derivation of the standard RL. By analyzing it, the Weber-Fechner law (WFL) is found, in which perception (i.e., the degree of updates) in response to a change in stimulus (i.e., TD error) is attenuated as the stimulus intensity (i.e., the value function) increases. To numerically demonstrate the utilities of WFL on RL, we propose a practical implementation using a reward-punishment framework and modify the definition of optimality. Further analysis of this implementation reveals that two utilities can be expected: i) to accelerate escaping from the situations with small rewards and ii) to pursue the minimum punishment as much as possible. We finally investigate and discuss the expected utilities through simulations and robot experiments. As a result, the proposed RL algorithm with WFL shows the expected utilities that accelerate the reward-maximizing startup and continue to suppress punishments during learning.

Abstract Image

查看原文本刊更多论文

时间差异学习中的韦伯-费希纳定律由控制作为推理推导而来。

研究了强化学习中基于时间差误差的值函数和策略函数的非线性更新规则。标准强化学习中的更新规则表明，TD误差与更新程度成线性比例，平等地对待所有奖励，没有任何偏见。另一方面，最近的生物学研究表明，在TD误差和更新程度上存在非线性，使政策倾向于乐观或悲观。由于非线性导致的这种学习偏差被认为是有用的，并且有意在生物学习中留下特征。因此，本研究探索了一个可以利用更新程度和TD误差之间的非线性的理论框架。为此，我们将重点放在前面工作中使用的控制作为推理框架上，其中不可计算的非线性项需要从标准RL的推导中近似地排除。通过分析，发现Weber-Fechner定律（WFL），即响应刺激（即TD误差）变化的感知（即更新程度）随着刺激强度（即值函数）的增加而衰减。为了在数值上证明WFL对RL的效用，我们提出了一个使用奖惩框架的实际实现，并修改了最优性的定义。对这种实现的进一步分析表明，可以预期两种效用：1)加速逃离奖励少的情况；2)尽可能追求最小的惩罚。最后通过仿真和机器人实验对预期效用进行了调查和讨论。因此，提出的带WFL的强化学习算法显示了预期的实用程序，加速了奖励最大化的启动，并在学习过程中继续抑制惩罚。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in Robotics and AI ROBOTICS-

CiteScore

6.50

自引率

5.90%

发文量

355

审稿时长

14 weeks

期刊介绍： Frontiers in Robotics and AI publishes rigorously peer-reviewed research covering all theory and applications of robotics, technology, and artificial intelligence, from biomedical to space robotics.