A survey of slow thinking-based reasoning LLMs using reinforcement learning and test-time scaling law

IF 6.9 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2025-09-12 DOI:10.1016/j.ipm.2025.104394

Qianjun Pan , Wenkai Ji , Yuyang Ding, Junsong Li, Shilian Chen, Junyi Wang, Jie Zhou, Qin Chen, Min Zhang, Yulan Wu, Liang He

{"title":"A survey of slow thinking-based reasoning LLMs using reinforcement learning and test-time scaling law","authors":"Qianjun Pan , Wenkai Ji , Yuyang Ding, Junsong Li, Shilian Chen, Junyi Wang, Jie Zhou, Qin Chen, Min Zhang, Yulan Wu, Liang He","doi":"10.1016/j.ipm.2025.104394","DOIUrl":null,"url":null,"abstract":"<div><div>This survey presents a focused and conceptually distinct framework for understanding recent advancements in reasoning large language models (LLMs) designed to emulate “slow thinking”, a deliberate, analytical mode of cognition analogous to System 2 in dual-process theory from cognitive psychology. While prior review works have surveyed reasoning LLMs through fragmented lenses, such as isolated technical paradigms (e.g., reinforcement learning or test-time scaling) or broad post-training taxonomies, this work uniquely integrates reinforcement learning and test-time scaling as synergistic mechanisms within a unified “slow thinking” paradigm. By synthesizing insights from over 200 studies, we identify three interdependent pillars that collectively enable advanced reasoning: (1) Test-time scaling, which dynamically allocates computational resources based on task complexity via search, adaptive computation, and verification; (2) Reinforcement learning, which refines reasoning trajectories through reward modeling, policy optimization, and self-improvement; and (3) Slow-thinking frameworks, which structure reasoning into stepwise, hierarchical, or hybrid processes such as long Chain-of-Thought and multi-agent deliberation. Unlike existing surveys, our framework is goal-oriented, centering on the cognitive objective of “slow thinking” as both a unifying principle and a design imperative. This perspective enables a systematic analysis of how diverse techniques converge toward human-like deep reasoning. The survey charts a trajectory toward next-generation LLMs that balance cognitive fidelity with computational efficiency, while also outlining key challenges and future directions. Advancing such reasoning capabilities is essential for deploying LLMs in high-stakes domains including scientific discovery, autonomous agents, and complex decision support systems.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 2","pages":"Article 104394"},"PeriodicalIF":6.9000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325003358","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

This survey presents a focused and conceptually distinct framework for understanding recent advancements in reasoning large language models (LLMs) designed to emulate “slow thinking”, a deliberate, analytical mode of cognition analogous to System 2 in dual-process theory from cognitive psychology. While prior review works have surveyed reasoning LLMs through fragmented lenses, such as isolated technical paradigms (e.g., reinforcement learning or test-time scaling) or broad post-training taxonomies, this work uniquely integrates reinforcement learning and test-time scaling as synergistic mechanisms within a unified “slow thinking” paradigm. By synthesizing insights from over 200 studies, we identify three interdependent pillars that collectively enable advanced reasoning: (1) Test-time scaling, which dynamically allocates computational resources based on task complexity via search, adaptive computation, and verification; (2) Reinforcement learning, which refines reasoning trajectories through reward modeling, policy optimization, and self-improvement; and (3) Slow-thinking frameworks, which structure reasoning into stepwise, hierarchical, or hybrid processes such as long Chain-of-Thought and multi-agent deliberation. Unlike existing surveys, our framework is goal-oriented, centering on the cognitive objective of “slow thinking” as both a unifying principle and a design imperative. This perspective enables a systematic analysis of how diverse techniques converge toward human-like deep reasoning. The survey charts a trajectory toward next-generation LLMs that balance cognitive fidelity with computational efficiency, while also outlining key challenges and future directions. Advancing such reasoning capabilities is essential for deploying LLMs in high-stakes domains including scientific discovery, autonomous agents, and complex decision support systems.

查看原文本刊更多论文

基于强化学习和测试时间标度律的慢思维推理llm研究

本研究为理解大型语言模型（llm）的最新进展提供了一个重点和概念上独特的框架，该模型旨在模仿“慢思维”，这是一种类似于认知心理学双过程理论中的系统2的深思熟虑的分析式认知模式。虽然之前的审查工作已经通过碎片化的镜头调查了推理法学硕士，例如孤立的技术范式（例如，强化学习或测试时间缩放）或广泛的训练后分类，但这项工作独特地将强化学习和测试时间缩放作为统一的“慢思维”范式中的协同机制集成在一起。通过综合来自200多个研究的见解，我们确定了三个相互依存的支柱，共同实现高级推理：(1)测试时间缩放，它通过搜索、自适应计算和验证，根据任务复杂性动态分配计算资源；(2)强化学习，通过奖励建模、策略优化和自我完善来细化推理轨迹；(3)慢思维框架，它将推理结构成逐步的、分层的或混合的过程，如长思维链和多智能体审议。与现有的调查不同，我们的框架是目标导向的，以“慢思维”的认知目标为中心，作为统一原则和设计要求。这一观点使我们能够系统地分析不同的技术是如何向类似人类的深度推理融合的。该调查描绘了下一代法学硕士在认知保真度和计算效率之间取得平衡的轨迹，同时也概述了主要挑战和未来方向。推进这种推理能力对于在高风险领域（包括科学发现、自主代理和复杂决策支持系统）部署法学硕士至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.