Seungpil Lee, Woochang Sim, Donghyeon Shin, Sanha Hwang, Wongyu Seo, Jiwon Park, Seokki Lee, Sejin Kim, Sundong Kim
{"title":"Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus","authors":"Seungpil Lee, Woochang Sim, Donghyeon Shin, Sanha Hwang, Wongyu Seo, Jiwon Park, Seokki Lee, Sejin Kim, Sundong Kim","doi":"arxiv-2403.11793","DOIUrl":null,"url":null,"abstract":"The existing methods for evaluating the inference abilities of Large Language\nModels (LLMs) have been results-centric, making it difficult to assess the\ninference process. We introduce a new approach using the Abstract and Reasoning\nCorpus (ARC) dataset to evaluate the inference and contextual understanding\nabilities of large language models in a process-centric manner. ARC demands\nrigorous logical structures for problem-solving, making it a benchmark that\nfacilitates the comparison of model inference abilities with humans.\nExperimental results confirm that while large language models possess weak\ninference abilities, they still lag in terms of logical coherence,\ncompositionality, and productivity. Our experiments highlight the reasoning\ncapabilities of LLMs, proposing development paths for achieving human-level\nreasoning.","PeriodicalId":501033,"journal":{"name":"arXiv - CS - Symbolic Computation","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Symbolic Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.11793","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The existing methods for evaluating the inference abilities of Large Language
Models (LLMs) have been results-centric, making it difficult to assess the
inference process. We introduce a new approach using the Abstract and Reasoning
Corpus (ARC) dataset to evaluate the inference and contextual understanding
abilities of large language models in a process-centric manner. ARC demands
rigorous logical structures for problem-solving, making it a benchmark that
facilitates the comparison of model inference abilities with humans.
Experimental results confirm that while large language models possess weak
inference abilities, they still lag in terms of logical coherence,
compositionality, and productivity. Our experiments highlight the reasoning
capabilities of LLMs, proposing development paths for achieving human-level
reasoning.
现有的评估大型语言模型(LLM)推理能力的方法都是以结果为中心,很难评估推理过程。我们引入了一种新方法,利用抽象与推理语料库(Abstract and ReasoningCorpus,ARC)数据集,以过程为中心来评估大型语言模型的推理能力和上下文理解能力。实验结果证实,虽然大型语言模型具有较弱的推理能力,但它们在逻辑一致性、组成性和生产率方面仍然落后。我们的实验凸显了 LLM 的推理能力,为实现人类水平的推理提出了发展路径。