A survey of slow thinking-based reasoning LLMs using reinforcement learning and test-time scaling law

IF 6.9 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Qianjun Pan , Wenkai Ji , Yuyang Ding, Junsong Li, Shilian Chen, Junyi Wang, Jie Zhou, Qin Chen, Min Zhang, Yulan Wu, Liang He
{"title":"A survey of slow thinking-based reasoning LLMs using reinforcement learning and test-time scaling law","authors":"Qianjun Pan ,&nbsp;Wenkai Ji ,&nbsp;Yuyang Ding,&nbsp;Junsong Li,&nbsp;Shilian Chen,&nbsp;Junyi Wang,&nbsp;Jie Zhou,&nbsp;Qin Chen,&nbsp;Min Zhang,&nbsp;Yulan Wu,&nbsp;Liang He","doi":"10.1016/j.ipm.2025.104394","DOIUrl":null,"url":null,"abstract":"<div><div>This survey presents a focused and conceptually distinct framework for understanding recent advancements in reasoning large language models (LLMs) designed to emulate “slow thinking”, a deliberate, analytical mode of cognition analogous to System 2 in dual-process theory from cognitive psychology. While prior review works have surveyed reasoning LLMs through fragmented lenses, such as isolated technical paradigms (e.g., reinforcement learning or test-time scaling) or broad post-training taxonomies, this work uniquely integrates reinforcement learning and test-time scaling as synergistic mechanisms within a unified “slow thinking” paradigm. By synthesizing insights from over 200 studies, we identify three interdependent pillars that collectively enable advanced reasoning: (1) Test-time scaling, which dynamically allocates computational resources based on task complexity via search, adaptive computation, and verification; (2) Reinforcement learning, which refines reasoning trajectories through reward modeling, policy optimization, and self-improvement; and (3) Slow-thinking frameworks, which structure reasoning into stepwise, hierarchical, or hybrid processes such as long Chain-of-Thought and multi-agent deliberation. Unlike existing surveys, our framework is goal-oriented, centering on the cognitive objective of “slow thinking” as both a unifying principle and a design imperative. This perspective enables a systematic analysis of how diverse techniques converge toward human-like deep reasoning. The survey charts a trajectory toward next-generation LLMs that balance cognitive fidelity with computational efficiency, while also outlining key challenges and future directions. Advancing such reasoning capabilities is essential for deploying LLMs in high-stakes domains including scientific discovery, autonomous agents, and complex decision support systems.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 2","pages":"Article 104394"},"PeriodicalIF":6.9000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325003358","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

This survey presents a focused and conceptually distinct framework for understanding recent advancements in reasoning large language models (LLMs) designed to emulate “slow thinking”, a deliberate, analytical mode of cognition analogous to System 2 in dual-process theory from cognitive psychology. While prior review works have surveyed reasoning LLMs through fragmented lenses, such as isolated technical paradigms (e.g., reinforcement learning or test-time scaling) or broad post-training taxonomies, this work uniquely integrates reinforcement learning and test-time scaling as synergistic mechanisms within a unified “slow thinking” paradigm. By synthesizing insights from over 200 studies, we identify three interdependent pillars that collectively enable advanced reasoning: (1) Test-time scaling, which dynamically allocates computational resources based on task complexity via search, adaptive computation, and verification; (2) Reinforcement learning, which refines reasoning trajectories through reward modeling, policy optimization, and self-improvement; and (3) Slow-thinking frameworks, which structure reasoning into stepwise, hierarchical, or hybrid processes such as long Chain-of-Thought and multi-agent deliberation. Unlike existing surveys, our framework is goal-oriented, centering on the cognitive objective of “slow thinking” as both a unifying principle and a design imperative. This perspective enables a systematic analysis of how diverse techniques converge toward human-like deep reasoning. The survey charts a trajectory toward next-generation LLMs that balance cognitive fidelity with computational efficiency, while also outlining key challenges and future directions. Advancing such reasoning capabilities is essential for deploying LLMs in high-stakes domains including scientific discovery, autonomous agents, and complex decision support systems.
基于强化学习和测试时间标度律的慢思维推理llm研究
本研究为理解大型语言模型(llm)的最新进展提供了一个重点和概念上独特的框架,该模型旨在模仿“慢思维”,这是一种类似于认知心理学双过程理论中的系统2的深思熟虑的分析式认知模式。虽然之前的审查工作已经通过碎片化的镜头调查了推理法学硕士,例如孤立的技术范式(例如,强化学习或测试时间缩放)或广泛的训练后分类,但这项工作独特地将强化学习和测试时间缩放作为统一的“慢思维”范式中的协同机制集成在一起。通过综合来自200多个研究的见解,我们确定了三个相互依存的支柱,共同实现高级推理:(1)测试时间缩放,它通过搜索、自适应计算和验证,根据任务复杂性动态分配计算资源;(2)强化学习,通过奖励建模、策略优化和自我完善来细化推理轨迹;(3)慢思维框架,它将推理结构成逐步的、分层的或混合的过程,如长思维链和多智能体审议。与现有的调查不同,我们的框架是目标导向的,以“慢思维”的认知目标为中心,作为统一原则和设计要求。这一观点使我们能够系统地分析不同的技术是如何向类似人类的深度推理融合的。该调查描绘了下一代法学硕士在认知保真度和计算效率之间取得平衡的轨迹,同时也概述了主要挑战和未来方向。推进这种推理能力对于在高风险领域(包括科学发现、自主代理和复杂决策支持系统)部署法学硕士至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information Processing & Management
Information Processing & Management 工程技术-计算机:信息系统
CiteScore
17.00
自引率
11.60%
发文量
276
审稿时长
39 days
期刊介绍: Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信