通过包括直接和间接途径来增强强化学习模型可以提高纹状体依赖任务的表现。

IF 3.6 2区生物学

PLoS Computational Biology Pub Date : 2023-08-18 eCollection Date: 2023-08-01 DOI:10.1371/journal.pcbi.1011385

Kim T Blackwell, Kenji Doya

{"title":"通过包括直接和间接途径来增强强化学习模型可以提高纹状体依赖任务的表现。","authors":"Kim T Blackwell, Kenji Doya","doi":"10.1371/journal.pcbi.1011385","DOIUrl":null,"url":null,"abstract":"A major advance in understanding learning behavior stems from experiments showing that reward learning requires dopamine inputs to striatal neurons and arises from synaptic plasticity of cortico-striatal synapses. Numerous reinforcement learning models mimic this dopamine-dependent synaptic plasticity by using the reward prediction error, which resembles dopamine neuron firing, to learn the best action in response to a set of cues. Though these models can explain many facets of behavior, reproducing some types of goal-directed behavior, such as renewal and reversal, require additional model components. Here we present a reinforcement learning model, TD2Q, which better corresponds to the basal ganglia with two Q matrices, one representing direct pathway neurons (G) and another representing indirect pathway neurons (N). Unlike previous two-Q architectures, a novel and critical aspect of TD2Q is to update the G and N matrices utilizing the temporal difference reward prediction error. A best action is selected for N and G using a softmax with a reward-dependent adaptive exploration parameter, and then differences are resolved using a second selection step applied to the two action probabilities. The model is tested on a range of multi-step tasks including extinction, renewal, discrimination; switching reward probability learning; and sequence learning. Simulations show that TD2Q produces behaviors similar to rodents in choice and sequence learning tasks, and that use of the temporal difference reward prediction error is required to learn multi-step tasks. Blocking the update rule on the N matrix blocks discrimination learning, as observed experimentally. Performance in the sequence learning task is dramatically improved with two matrices. These results suggest that including additional aspects of basal ganglia physiology can improve the performance of reinforcement learning models, better reproduce animal behaviors, and provide insight as to the role of direct- and indirect-pathway striatal neurons.","PeriodicalId":49688,"journal":{"name":"PLoS Computational Biology","volume":"19 8","pages":"e1011385"},"PeriodicalIF":3.6000,"publicationDate":"2023-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10479916/pdf/","citationCount":"1","resultStr":"{\"title\":\"Enhancing reinforcement learning models by including direct and indirect pathways improves performance on striatal dependent tasks.\",\"authors\":\"Kim T Blackwell, Kenji Doya\",\"doi\":\"10.1371/journal.pcbi.1011385\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A major advance in understanding learning behavior stems from experiments showing that reward learning requires dopamine inputs to striatal neurons and arises from synaptic plasticity of cortico-striatal synapses. Numerous reinforcement learning models mimic this dopamine-dependent synaptic plasticity by using the reward prediction error, which resembles dopamine neuron firing, to learn the best action in response to a set of cues. Though these models can explain many facets of behavior, reproducing some types of goal-directed behavior, such as renewal and reversal, require additional model components. Here we present a reinforcement learning model, TD2Q, which better corresponds to the basal ganglia with two Q matrices, one representing direct pathway neurons (G) and another representing indirect pathway neurons (N). Unlike previous two-Q architectures, a novel and critical aspect of TD2Q is to update the G and N matrices utilizing the temporal difference reward prediction error. A best action is selected for N and G using a softmax with a reward-dependent adaptive exploration parameter, and then differences are resolved using a second selection step applied to the two action probabilities. The model is tested on a range of multi-step tasks including extinction, renewal, discrimination; switching reward probability learning; and sequence learning. Simulations show that TD2Q produces behaviors similar to rodents in choice and sequence learning tasks, and that use of the temporal difference reward prediction error is required to learn multi-step tasks. Blocking the update rule on the N matrix blocks discrimination learning, as observed experimentally. Performance in the sequence learning task is dramatically improved with two matrices. These results suggest that including additional aspects of basal ganglia physiology can improve the performance of reinforcement learning models, better reproduce animal behaviors, and provide insight as to the role of direct- and indirect-pathway striatal neurons.\",\"PeriodicalId\":49688,\"journal\":{\"name\":\"PLoS Computational Biology\",\"volume\":\"19 8\",\"pages\":\"e1011385\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2023-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10479916/pdf/\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS Computational Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pcbi.1011385\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/8/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pcbi.1011385","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/8/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在理解学习行为方面的一个重大进展源于实验，该实验表明，奖励学习需要多巴胺输入到纹状体神经元，并源于皮质-纹状体突触的突触可塑性。许多强化学习模型通过使用类似于多巴胺神经元放电的奖励预测误差来学习对一组线索的最佳反应，从而模拟这种多巴胺依赖性突触可塑性。尽管这些模型可以解释行为的许多方面，但复制某些类型的目标导向行为，如更新和反转，需要额外的模型组件。在这里，我们提出了一个强化学习模型TD2Q，它更好地对应于具有两个Q矩阵的基底神经节，一个表示直接通路神经元（G），另一个表示间接通路神经元（N）。与之前的两个Q架构不同，TD2Q的一个新颖而关键的方面是利用时间差奖励预测误差来更新G和N矩阵。使用具有依赖于奖励的自适应探索参数的softmax为N和G选择最佳动作，然后使用应用于两个动作概率的第二选择步骤来解决差异。该模型在一系列多步骤任务中进行了测试，包括灭绝、更新、歧视；切换奖励概率学习；以及序列学习。仿真表明，TD2Q在选择和序列学习任务中产生类似啮齿动物的行为，并且需要使用时间差奖励预测误差来学习多步骤任务。如实验所观察到的，在N个矩阵上阻止更新规则会阻止区分学习。使用两个矩阵可以显著提高序列学习任务的性能。这些结果表明，包括基底节生理学的其他方面可以提高强化学习模型的性能，更好地再现动物行为，并深入了解直接和间接途径纹状体神经元的作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Enhancing reinforcement learning models by including direct and indirect pathways improves performance on striatal dependent tasks.

查看原文本刊更多论文

Enhancing reinforcement learning models by including direct and indirect pathways improves performance on striatal dependent tasks.

A major advance in understanding learning behavior stems from experiments showing that reward learning requires dopamine inputs to striatal neurons and arises from synaptic plasticity of cortico-striatal synapses. Numerous reinforcement learning models mimic this dopamine-dependent synaptic plasticity by using the reward prediction error, which resembles dopamine neuron firing, to learn the best action in response to a set of cues. Though these models can explain many facets of behavior, reproducing some types of goal-directed behavior, such as renewal and reversal, require additional model components. Here we present a reinforcement learning model, TD2Q, which better corresponds to the basal ganglia with two Q matrices, one representing direct pathway neurons (G) and another representing indirect pathway neurons (N). Unlike previous two-Q architectures, a novel and critical aspect of TD2Q is to update the G and N matrices utilizing the temporal difference reward prediction error. A best action is selected for N and G using a softmax with a reward-dependent adaptive exploration parameter, and then differences are resolved using a second selection step applied to the two action probabilities. The model is tested on a range of multi-step tasks including extinction, renewal, discrimination; switching reward probability learning; and sequence learning. Simulations show that TD2Q produces behaviors similar to rodents in choice and sequence learning tasks, and that use of the temporal difference reward prediction error is required to learn multi-step tasks. Blocking the update rule on the N matrix blocks discrimination learning, as observed experimentally. Performance in the sequence learning task is dramatically improved with two matrices. These results suggest that including additional aspects of basal ganglia physiology can improve the performance of reinforcement learning models, better reproduce animal behaviors, and provide insight as to the role of direct- and indirect-pathway striatal neurons.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PLoS Computational Biology 生物-生化研究方法

CiteScore

7.10

自引率

4.70%

发文量

820

期刊介绍： PLOS Computational Biology features works of exceptional significance that further our understanding of living systems at all scales—from molecules and cells, to patient populations and ecosystems—through the application of computational methods. Readers include life and computational scientists, who can take the important findings presented here to the next level of discovery. Research articles must be declared as belonging to a relevant section. More information about the sections can be found in the submission guidelines. Research articles should model aspects of biological systems, demonstrate both methodological and scientific novelty, and provide profound new biological insights. Generally, reliability and significance of biological discovery through computation should be validated and enriched by experimental studies. Inclusion of experimental validation is not required for publication, but should be referenced where possible. Inclusion of experimental validation of a modest biological discovery through computation does not render a manuscript suitable for PLOS Computational Biology. Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities.