时间差异学习的硬连线神经回路。

bioRxiv : the preprint server for biology Pub Date : 2025-10-02 DOI:10.1101/2025.09.18.677203

Malcolm G Campbell, Yongsoo Ra, Zhiqin Chen, Shudi Xu, Mark Burrell, Sara Matias, Mitsuko Watabe-Uchida, Naoshige Uchida

{"title":"时间差异学习的硬连线神经回路。","authors":"Malcolm G Campbell, Yongsoo Ra, Zhiqin Chen, Shudi Xu, Mark Burrell, Sara Matias, Mitsuko Watabe-Uchida, Naoshige Uchida","doi":"10.1101/2025.09.18.677203","DOIUrl":null,"url":null,"abstract":"The neurotransmitter dopamine plays a major role in learning by acting as a teaching signal to update the brain's predictions about rewards. A leading theory proposes that this process is analogous to a reinforcement learning algorithm called temporal difference (TD) learning, and that dopamine acts as the error term within the TD algorithm (TD error). Although many studies have demonstrated similarities between dopamine activity and TD errors1-5, the mechanistic basis for dopaminergic TD learning remains unknown. Here, we combined large-scale neural recordings with patterned optogenetic stimulation to examine whether and how the key steps in TD learning are accomplished by the circuitry connecting dopamine neurons and their targets. Replacing natural rewards with optogenetic stimulation of dopamine axons in the nucleus accumbens (NAc) in a classical conditioning task gradually generated TD error-like activity patterns in dopamine neurons by specifically modifying the task-related activity of NAc neurons expressing the D1 dopamine receptor (D1 neurons). In turn, patterned optogenetic stimulation of NAc D1 neurons in naïve animals drove dopamine neuron spiking according to the TD error of the stimulation pattern, indicating that TD computations are hardwired into this circuit. The transformation from D1 neurons to dopamine neurons could be described by a biphasic linear filter, with a rapid positive and delayed negative phase, that effectively computes a temporal difference. This finding suggests that the time horizon over which the TD algorithm operates-the temporal discount factor-is set by the balance of the positive and negative components of the linear filter, pointing to a circuit-level mechanism for temporal discounting. These results provide a new conceptual framework for understanding how the computations and parameters governing animal learning arise from neurobiological components.","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12458361/pdf/","citationCount":"0","resultStr":"{\"title\":\"A hardwired neural circuit for temporal difference learning.\",\"authors\":\"Malcolm G Campbell, Yongsoo Ra, Zhiqin Chen, Shudi Xu, Mark Burrell, Sara Matias, Mitsuko Watabe-Uchida, Naoshige Uchida\",\"doi\":\"10.1101/2025.09.18.677203\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The neurotransmitter dopamine plays a major role in learning by acting as a teaching signal to update the brain's predictions about rewards. A leading theory proposes that this process is analogous to a reinforcement learning algorithm called temporal difference (TD) learning, and that dopamine acts as the error term within the TD algorithm (TD error). Although many studies have demonstrated similarities between dopamine activity and TD errors1-5, the mechanistic basis for dopaminergic TD learning remains unknown. Here, we combined large-scale neural recordings with patterned optogenetic stimulation to examine whether and how the key steps in TD learning are accomplished by the circuitry connecting dopamine neurons and their targets. Replacing natural rewards with optogenetic stimulation of dopamine axons in the nucleus accumbens (NAc) in a classical conditioning task gradually generated TD error-like activity patterns in dopamine neurons by specifically modifying the task-related activity of NAc neurons expressing the D1 dopamine receptor (D1 neurons). In turn, patterned optogenetic stimulation of NAc D1 neurons in naïve animals drove dopamine neuron spiking according to the TD error of the stimulation pattern, indicating that TD computations are hardwired into this circuit. The transformation from D1 neurons to dopamine neurons could be described by a biphasic linear filter, with a rapid positive and delayed negative phase, that effectively computes a temporal difference. This finding suggests that the time horizon over which the TD algorithm operates-the temporal discount factor-is set by the balance of the positive and negative components of the linear filter, pointing to a circuit-level mechanism for temporal discounting. These results provide a new conceptual framework for understanding how the computations and parameters governing animal learning arise from neurobiological components.\",\"PeriodicalId\":519960,\"journal\":{\"name\":\"bioRxiv : the preprint server for biology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12458361/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv : the preprint server for biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2025.09.18.677203\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.09.18.677203","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

神经递质多巴胺在学习中起着重要作用，它作为一种教学信号来更新大脑对奖励的预测。一个领先的理论提出，这一过程类似于一种称为时间差异（TD）学习的强化学习算法，多巴胺在TD算法中充当误差项（TD误差）。尽管许多研究已经证明了多巴胺活性和TD错误之间的相似性，但多巴胺能TD学习的机制基础仍不清楚。在这里，我们将大规模的神经记录与模式光遗传刺激相结合，以研究多巴胺神经元与其目标之间的连接回路是否以及如何完成TD学习的关键步骤。在经典条件反射任务中，通过特异性地改变表达D1多巴胺受体的伏隔核（NAc）神经元（D1神经元）的任务相关活性，以光遗传刺激替代自然奖励，逐渐在多巴胺神经元中产生TD错误样活动模式。反过来，naïve动物NAc D1神经元的模式光遗传刺激根据刺激模式的TD误差驱动多巴胺神经元尖峰，表明TD计算已硬连接到该电路中。从D1神经元到多巴胺神经元的转换可以用双相线性滤波器来描述，该滤波器具有快速的正相和延迟的负相，可以有效地计算时间差。这一发现表明，TD算法运行的时间范围——时间折扣因子——是由线性滤波器的正负分量的平衡设定的，指向了一个电路级的时间折扣机制。这些结果为理解控制动物学习的计算和参数是如何从神经生物学成分中产生的提供了一个新的概念框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A hardwired neural circuit for temporal difference learning.

The neurotransmitter dopamine plays a major role in learning by acting as a teaching signal to update the brain's predictions about rewards. A leading theory proposes that this process is analogous to a reinforcement learning algorithm called temporal difference (TD) learning, and that dopamine acts as the error term within the TD algorithm (TD error). Although many studies have demonstrated similarities between dopamine activity and TD errors^1-5, the mechanistic basis for dopaminergic TD learning remains unknown. Here, we combined large-scale neural recordings with patterned optogenetic stimulation to examine whether and how the key steps in TD learning are accomplished by the circuitry connecting dopamine neurons and their targets. Replacing natural rewards with optogenetic stimulation of dopamine axons in the nucleus accumbens (NAc) in a classical conditioning task gradually generated TD error-like activity patterns in dopamine neurons by specifically modifying the task-related activity of NAc neurons expressing the D1 dopamine receptor (D1 neurons). In turn, patterned optogenetic stimulation of NAc D1 neurons in naïve animals drove dopamine neuron spiking according to the TD error of the stimulation pattern, indicating that TD computations are hardwired into this circuit. The transformation from D1 neurons to dopamine neurons could be described by a biphasic linear filter, with a rapid positive and delayed negative phase, that effectively computes a temporal difference. This finding suggests that the time horizon over which the TD algorithm operates-the temporal discount factor-is set by the balance of the positive and negative components of the linear filter, pointing to a circuit-level mechanism for temporal discounting. These results provide a new conceptual framework for understanding how the computations and parameters governing animal learning arise from neurobiological components.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量