MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning

arXiv - CS - Computation and Language Pub Date : 2024-09-18 DOI:arxiv-2409.12147

Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal

{"title":"MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning","authors":"Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal","doi":"arxiv-2409.12147","DOIUrl":null,"url":null,"abstract":"Large Language Models' (LLM) reasoning can be improved using test-time\naggregation strategies, i.e., generating multiple samples and voting among\ngenerated samples. While these improve performance, they often reach a\nsaturation point. Refinement offers an alternative by using LLM-generated\nfeedback to improve solution quality. However, refinement introduces 3 key\nchallenges: (1) Excessive refinement: Uniformly refining all instances can\nover-correct and reduce the overall performance. (2) Inability to localize and\naddress errors: LLMs have a limited ability to self-correct and struggle to\nidentify and correct their own mistakes. (3) Insufficient refinement: Deciding\nhow many iterations of refinement are needed is non-trivial, and stopping too\nsoon could leave errors unaddressed. To tackle these issues, we propose\nMAgICoRe, which avoids excessive refinement by categorizing problem difficulty\nas easy or hard, solving easy problems with coarse-grained aggregation and hard\nones with fine-grained and iterative multi-agent refinement. To improve error\nlocalization, we incorporate external step-wise reward model (RM) scores.\nMoreover, to ensure effective refinement, we employ a multi-agent loop with\nthree agents: Solver, Reviewer (which generates targeted feedback based on\nstep-wise RM scores), and the Refiner (which incorporates feedback). To ensure\nsufficient refinement, we re-evaluate updated solutions, iteratively initiating\nfurther rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5\nand show its effectiveness across 5 math datasets. Even one iteration of\nMAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by\n4.0% while using less than half the samples. Unlike iterative refinement with\nbaselines, MAgICoRe continues to improve with more iterations. Finally, our\nablations highlight the importance of MAgICoRe's RMs and multi-agent\ncommunication.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models' (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples. While these improve performance, they often reach a saturation point. Refinement offers an alternative by using LLM-generated feedback to improve solution quality. However, refinement introduces 3 key challenges: (1) Excessive refinement: Uniformly refining all instances can over-correct and reduce the overall performance. (2) Inability to localize and address errors: LLMs have a limited ability to self-correct and struggle to identify and correct their own mistakes. (3) Insufficient refinement: Deciding how many iterations of refinement are needed is non-trivial, and stopping too soon could leave errors unaddressed. To tackle these issues, we propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard, solving easy problems with coarse-grained aggregation and hard ones with fine-grained and iterative multi-agent refinement. To improve error localization, we incorporate external step-wise reward model (RM) scores. Moreover, to ensure effective refinement, we employ a multi-agent loop with three agents: Solver, Reviewer (which generates targeted feedback based on step-wise RM scores), and the Refiner (which incorporates feedback). To ensure sufficient refinement, we re-evaluate updated solutions, iteratively initiating further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5 and show its effectiveness across 5 math datasets. Even one iteration of MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by 4.0% while using less than half the samples. Unlike iterative refinement with baselines, MAgICoRe continues to improve with more iterations. Finally, our ablations highlight the importance of MAgICoRe's RMs and multi-agent communication.

查看原文本刊更多论文

MAgICoRe：多代理、迭代、从粗到细的推理改进

大型语言模型（LLM）的推理能力可以通过使用测试时间聚合策略（即生成多个样本并在生成的样本中进行投票）来提高。虽然这些方法能提高性能，但往往会达到饱和点。精细化提供了另一种选择，即利用 LLM 生成的反馈来提高解决方案的质量。然而，细化带来了 3 个关键挑战：(1) 过度细化：统一细化所有实例可能会过度修正，降低整体性能。(2) 无法定位和处理错误：LLM 的自我纠正能力有限，很难识别和纠正自己的错误。(3) 细化不足：决定需要进行多少次迭代细化并非易事，过早停止细化可能会导致错误得不到解决。为了解决这些问题，我们提出了 MAgICoRe，它通过将问题难度分为易和难来避免过度细化，用粗粒度聚合来解决易问题，用细粒度和多代理迭代细化来解决难问题。此外，为了确保有效的细化，我们采用了一个由三个代理组成的多代理循环：此外，我们还采用了由三个代理组成的多代理循环：求解器、审查器（根据分步 RM 分数生成有针对性的反馈）和精炼器（吸收反馈）。为确保充分完善，我们会重新评估更新后的解决方案，并迭代启动更多轮完善。我们在 Llama-3-8B 和 GPT-3.5 上对 MAgICoRe 进行了评估，并在 5 个数学数据集上展示了其有效性。即使是一次迭代，MAgICoRe 也比 Self-Consistency 高出 3.4%，比 Best-of-k 高出 3.2%，比 Self-Refine 高出 4.0%，而使用的样本还不到一半。与使用基线的迭代改进不同，MAgICoRe 会随着迭代次数的增加而不断改进。最后，我们的迭代突出了 MAgICoRe 的 RM 和多基因通信的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量