{"title":"CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks","authors":"Tianlong Wang, Xueting Han, Jing Bai","doi":"arxiv-2409.08642","DOIUrl":null,"url":null,"abstract":"Post-training large language models (LLMs) to develop reasoning capabilities\nhas proven effective across diverse domains, such as mathematical reasoning and\ncode generation. However, existing methods primarily focus on improving\ntask-specific reasoning but have not adequately addressed the model's\ngeneralization capabilities across a broader range of reasoning tasks. To\ntackle this challenge, we introduce Critical Planning Step Learning (CPL),\nwhich leverages Monte Carlo Tree Search (MCTS) to explore diverse planning\nsteps in multi-step reasoning tasks. Based on long-term outcomes, CPL learns\nstep-level planning preferences to improve the model's planning capabilities\nand, consequently, its general reasoning capabilities. Furthermore, while\neffective in many scenarios for aligning LLMs, existing preference learning\napproaches like Direct Preference Optimization (DPO) struggle with complex\nmulti-step reasoning tasks due to their inability to capture fine-grained\nsupervision at each step. We propose Step-level Advantage Preference\nOptimization (Step-APO), which integrates an advantage estimate for step-level\npreference pairs obtained via MCTS into the DPO. This enables the model to more\neffectively learn critical intermediate planning steps, thereby further\nimproving its generalization in reasoning tasks. Experimental results\ndemonstrate that our method, trained exclusively on GSM8K and MATH, not only\nsignificantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also\nenhances out-of-domain reasoning benchmarks, such as ARC-C (+4.0%), BBH\n(+1.8%), MMLU-STEM (+2.2%), and MMLU (+0.9%).","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08642","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Post-training large language models (LLMs) to develop reasoning capabilities
has proven effective across diverse domains, such as mathematical reasoning and
code generation. However, existing methods primarily focus on improving
task-specific reasoning but have not adequately addressed the model's
generalization capabilities across a broader range of reasoning tasks. To
tackle this challenge, we introduce Critical Planning Step Learning (CPL),
which leverages Monte Carlo Tree Search (MCTS) to explore diverse planning
steps in multi-step reasoning tasks. Based on long-term outcomes, CPL learns
step-level planning preferences to improve the model's planning capabilities
and, consequently, its general reasoning capabilities. Furthermore, while
effective in many scenarios for aligning LLMs, existing preference learning
approaches like Direct Preference Optimization (DPO) struggle with complex
multi-step reasoning tasks due to their inability to capture fine-grained
supervision at each step. We propose Step-level Advantage Preference
Optimization (Step-APO), which integrates an advantage estimate for step-level
preference pairs obtained via MCTS into the DPO. This enables the model to more
effectively learn critical intermediate planning steps, thereby further
improving its generalization in reasoning tasks. Experimental results
demonstrate that our method, trained exclusively on GSM8K and MATH, not only
significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also
enhances out-of-domain reasoning benchmarks, such as ARC-C (+4.0%), BBH
(+1.8%), MMLU-STEM (+2.2%), and MMLU (+0.9%).