How Well Do Multi-hop Reading Comprehension Models Understand Date Information?

Q3 Environmental Science

AACL Bioflux Pub Date : 2022-10-11 DOI:10.48550/arXiv.2210.05208

Xanh Ho, Saku Sugawara, Akiko Aizawa

{"title":"How Well Do Multi-hop Reading Comprehension Models Understand Date Information?","authors":"Xanh Ho, Saku Sugawara, Akiko Aizawa","doi":"10.48550/arXiv.2210.05208","DOIUrl":null,"url":null,"abstract":"Several multi-hop reading comprehension datasets have been proposed to resolve the issue of reasoning shortcuts by which questions can be answered without performing multi-hop reasoning. However, the ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear. It is also unclear how questions about the internal reasoning process are useful for training and evaluating question-answering (QA) systems. To evaluate the model precisely in a hierarchical manner, we first propose a dataset, HieraDate, with three probing tasks in addition to the main question: extraction, reasoning, and robustness. Our dataset is created by enhancing two previous multi-hop datasets, HotpotQA and 2WikiMultiHopQA, focusing on multi-hop questions on date information that involve both comparison and numerical reasoning. We then evaluate the ability of existing models to understand date information. Our experimental results reveal that the multi-hop models do not have the ability to subtract two dates even when they perform well in date comparison and number subtraction tasks. Other results reveal that our probing questions can help to improve the performance of the models (e.g., by +10.3 F1) on the main QA task and our dataset can be used for data augmentation to improve the robustness of the models.","PeriodicalId":39298,"journal":{"name":"AACL Bioflux","volume":"375 1","pages":"470-479"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AACL Bioflux","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.05208","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Environmental Science","Score":null,"Total":0}

引用次数: 2

Abstract

Several multi-hop reading comprehension datasets have been proposed to resolve the issue of reasoning shortcuts by which questions can be answered without performing multi-hop reasoning. However, the ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear. It is also unclear how questions about the internal reasoning process are useful for training and evaluating question-answering (QA) systems. To evaluate the model precisely in a hierarchical manner, we first propose a dataset, HieraDate, with three probing tasks in addition to the main question: extraction, reasoning, and robustness. Our dataset is created by enhancing two previous multi-hop datasets, HotpotQA and 2WikiMultiHopQA, focusing on multi-hop questions on date information that involve both comparison and numerical reasoning. We then evaluate the ability of existing models to understand date information. Our experimental results reveal that the multi-hop models do not have the ability to subtract two dates even when they perform well in date comparison and number subtraction tasks. Other results reveal that our probing questions can help to improve the performance of the models (e.g., by +10.3 F1) on the main QA task and our dataset can be used for data augmentation to improve the robustness of the models.

查看原文本刊更多论文

多跳阅读理解模型对日期信息的理解程度如何?

为了解决不需要进行多跳推理就能回答问题的推理捷径问题，已经提出了几个多跳阅读理解数据集。然而，在找到比较问题的答案时，多跳模型执行逐步推理的能力仍然不清楚。关于内部推理过程的问题如何对培训和评估问答(QA)系统有用也尚不清楚。为了以分层的方式精确地评估模型，我们首先提出了一个数据集，HieraDate，除了主要问题外，还有三个探测任务:提取、推理和鲁棒性。我们的数据集是通过增强之前的两个多跳数据集HotpotQA和2WikiMultiHopQA创建的，重点关注涉及比较和数值推理的日期信息的多跳问题。然后我们评估现有模型理解日期信息的能力。我们的实验结果表明，即使多跳模型在日期比较和数字减法任务中表现良好，也不具有减法两个日期的能力。其他结果表明，我们的探索性问题可以帮助提高模型在主要QA任务上的性能(例如+10.3 F1)，我们的数据集可以用于数据增强，以提高模型的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AACL Bioflux Environmental Science-Management, Monitoring, Policy and Law

CiteScore

1.40

自引率

0.00%

发文量