Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett
{"title":"To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning","authors":"Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett","doi":"arxiv-2409.12183","DOIUrl":null,"url":null,"abstract":"Chain-of-thought (CoT) via prompting is the de facto method for eliciting\nreasoning capabilities from large language models (LLMs). But for what kinds of\ntasks is this extra ``thinking'' really helpful? To analyze this, we conducted\na quantitative meta-analysis covering over 100 papers using CoT and ran our own\nevaluations of 20 datasets across 14 models. Our results show that CoT gives\nstrong performance benefits primarily on tasks involving math or logic, with\nmuch smaller gains on other types of tasks. On MMLU, directly generating the\nanswer without CoT leads to almost identical accuracy as CoT unless the\nquestion or model's response contains an equals sign, indicating symbolic\noperations and reasoning. Following this finding, we analyze the behavior of\nCoT on these problems by separating planning and execution and comparing\nagainst tool-augmented LLMs. Much of CoT's gain comes from improving symbolic\nexecution, but it underperforms relative to using a symbolic solver. Our\nresults indicate that CoT can be applied selectively, maintaining performance\nwhile saving inference costs. Furthermore, they suggest a need to move beyond\nprompt-based CoT to new paradigms that better leverage intermediate computation\nacross the whole range of LLM applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Chain-of-thought (CoT) via prompting is the de facto method for eliciting
reasoning capabilities from large language models (LLMs). But for what kinds of
tasks is this extra ``thinking'' really helpful? To analyze this, we conducted
a quantitative meta-analysis covering over 100 papers using CoT and ran our own
evaluations of 20 datasets across 14 models. Our results show that CoT gives
strong performance benefits primarily on tasks involving math or logic, with
much smaller gains on other types of tasks. On MMLU, directly generating the
answer without CoT leads to almost identical accuracy as CoT unless the
question or model's response contains an equals sign, indicating symbolic
operations and reasoning. Following this finding, we analyze the behavior of
CoT on these problems by separating planning and execution and comparing
against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic
execution, but it underperforms relative to using a symbolic solver. Our
results indicate that CoT can be applied selectively, maintaining performance
while saving inference costs. Furthermore, they suggest a need to move beyond
prompt-based CoT to new paradigms that better leverage intermediate computation
across the whole range of LLM applications.