Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2025-02-03 DOI:10.1109/TKDE.2025.3536008

Fangzhi Xu;Qika Lin;Jiawei Han;Tianzhe Zhao;Jun Liu;Erik Cambria

{"title":"Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond","authors":"Fangzhi Xu;Qika Lin;Jiawei Han;Tianzhe Zhao;Jun Liu;Erik Cambria","doi":"10.1109/TKDE.2025.3536008","DOIUrl":null,"url":null,"abstract":"Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., <italic>accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including <italic>answer correctness, <italic>explain correctness, <italic>explain completeness and <italic>explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., <italic>evidence selection process and <italic>reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., <italic>Correct, <italic>Rigorous, <italic>Self-aware, <italic>Active, <italic>Oriented and <italic>No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 4","pages":"1620-1634"},"PeriodicalIF":8.9000,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10870148/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.

查看原文本刊更多论文

大型语言模型真的是好的逻辑推理器吗？综合评价及超越

逻辑推理在知识工程和人工智能领域一直扮演着重要的角色。近年来，大型语言模型（llm）成为自然语言处理（NLP）领域一个值得关注的创新。然而，法学硕士是否能够有效地解决逻辑推理任务的问题，这需要类似于人类智能的渐进认知推理，仍然没有答案。为此，我们的目标是弥合这一差距，并在本文中提供全面的评估。首先，为了提供系统的评估，我们选择了15个典型的逻辑推理数据集，并将它们组织成演绎、归纳、溯因和混合形式的推理设置。考虑到评价的综合性，我们选取了3个早期代表性法学硕士和4个趋势法学硕士。其次，与以往仅依赖于简单指标（如准确性）的评价不同，我们提出了客观和主观的精细评价，包括答案和解释，包括答案正确性、解释正确性、解释完整性和解释冗余。此外，为了揭示法学硕士的逻辑缺陷，将问题案例从证据选择过程和推理过程两个维度归结为五种错误类型。第三，为了避免知识偏差的影响，纯粹专注于对llm的逻辑推理能力进行基准测试，我们提出了一个具有中性内容的新数据集。在深入评价的基础上，本文最终形成了从正确、严谨、自觉、主动、定向、无幻觉六个维度对逻辑推理能力的总体评价方案。它反映了法学硕士的利弊，并为今后的工作提供了指导方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.