LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation Pub Date : 2023-11-04 DOI:10.1007/s10579-023-09691-y

Ishan Tarunesh, Somak Aditya, Monojit Choudhury

{"title":"LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI","authors":"Ishan Tarunesh, Somak Aditya, Monojit Choudhury","doi":"10.1007/s10579-023-09691-y","DOIUrl":null,"url":null,"abstract":"Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: (1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); (2) design experiments to study cross-capability information content (leave one out or bring one in); and (3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models – supporting and extending previous observations; thus showing the utility of the proposed testbench.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"11 8","pages":"0"},"PeriodicalIF":1.8000,"publicationDate":"2023-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10579-023-09691-y","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 4

Abstract

Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: (1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); (2) design experiments to study cross-capability information content (leave one out or bring one in); and (3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models – supporting and extending previous observations; thus showing the utility of the proposed testbench.

Abstract Image

查看原文本刊更多论文

用于测试NLI的各种逻辑推理能力的可扩展框架

自然语言推理(NLI)被认为是测试自然语言理解能力的代表性任务。在这项工作中，我们提出了一个可扩展的框架，以集体但分类地测试NLI(以及通过扩展，NLU)所需的各种逻辑推理能力。在行为测试的激励下，我们创建了一个半合成的大型测试台(363个模板，363k个示例)和一个相关框架，提供以下实用工具:(1)在17个推理维度(包括实用推理)上单独测试和分析推理能力;(2)设计实验，研究跨能力的信息内容(留一项或加一项);(3)综合性质使我们能够控制人为因素和偏见。我们从自由形式的自然语言模板(CheckList)和定义良好的功能分类中扩展了一个公开可用的自动化测试用例实例化框架，以覆盖范围越来越广的越来越难的测试用例，同时改变自然语言的复杂性。通过对最先进的NLI系统的分析，我们观察到我们的基准测试确实很难(即使使用额外资源进行训练也很重要)。有些功能比较难。此外，细粒度分析和微调实验揭示了关于这些能力和模型的更多见解——支持和扩展了以前的观察结果;由此可见所提出的试验台的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.