SESAME - self-supervised framework for extractive question answering over document collections

IF 2.3 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Intelligent Information Systems Pub Date : 2024-07-30 DOI:10.1007/s10844-024-00869-6

Vitor A. Batista, Diogo S. M. Gomes, Alexandre Evsukoff

{"title":"SESAME - self-supervised framework for extractive question answering over document collections","authors":"Vitor A. Batista, Diogo S. M. Gomes, Alexandre Evsukoff","doi":"10.1007/s10844-024-00869-6","DOIUrl":null,"url":null,"abstract":"<p>Question Answering is one of the most relevant areas in the field of Natural Language Processing, rapidly evolving with promising results due to the increasing availability of suitable datasets and the advent of new technologies, such as Generative Models. This article introduces SESAME, a Self-supervised framework for Extractive queStion Answering over docuMent collEctions. SESAME aims to enhance open-domain question answering systems (ODQA) by leveraging domain adaptation with synthetic datasets, enabling efficient question answering over private document collections with low resource usage. The framework incorporates recent advances with large language models, and an efficient hybrid method for context retrieval. We conducted several sets of experiments with the Machine Reading for Question Answering (MRQA) 2019 Shared Task datasets, FAQuAD - a Brazilian Portuguese reading comprehension dataset, Wikipedia, and Retrieval-Augmented Generation Benchmark, to demonstrate SESAME’s effectiveness. The results indicate that SESAME’s domain adaptation using synthetic data significantly improves QA performance, generalizes across different domains and languages, and competes with or surpasses state-of-the-art systems in ODQA. Finally, SESAME is an open-source tool, and all code, datasets and experimental data are available for public use in our repository.</p>","PeriodicalId":56119,"journal":{"name":"Journal of Intelligent Information Systems","volume":"15 1","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intelligent Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10844-024-00869-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Question Answering is one of the most relevant areas in the field of Natural Language Processing, rapidly evolving with promising results due to the increasing availability of suitable datasets and the advent of new technologies, such as Generative Models. This article introduces SESAME, a Self-supervised framework for Extractive queStion Answering over docuMent collEctions. SESAME aims to enhance open-domain question answering systems (ODQA) by leveraging domain adaptation with synthetic datasets, enabling efficient question answering over private document collections with low resource usage. The framework incorporates recent advances with large language models, and an efficient hybrid method for context retrieval. We conducted several sets of experiments with the Machine Reading for Question Answering (MRQA) 2019 Shared Task datasets, FAQuAD - a Brazilian Portuguese reading comprehension dataset, Wikipedia, and Retrieval-Augmented Generation Benchmark, to demonstrate SESAME’s effectiveness. The results indicate that SESAME’s domain adaptation using synthetic data significantly improves QA performance, generalizes across different domains and languages, and competes with or surpasses state-of-the-art systems in ODQA. Finally, SESAME is an open-source tool, and all code, datasets and experimental data are available for public use in our repository.

Abstract Image

查看原文本刊更多论文

SESAME - 文件集抽取式问题解答自监督框架

问题解答是自然语言处理领域中最相关的领域之一，由于合适数据集的可用性不断提高以及生成模型等新技术的出现，该领域发展迅速，成果喜人。本文介绍的 SESAME 是一个用于文档拼合提取式问题解答的自监督框架。SESAME 旨在通过利用合成数据集的领域适应性来增强开放领域问题解答系统（ODQA），从而以较低的资源使用率在私有文档集上实现高效的问题解答。该框架结合了最近在大型语言模型方面取得的进展，以及一种高效的上下文检索混合方法。我们使用机器阅读问题解答（MRQA）2019 共享任务数据集、FAQuAD（巴西葡萄牙语阅读理解数据集）、维基百科和检索增强生成基准进行了多组实验，以证明 SESAME 的有效性。结果表明，SESAME 利用合成数据进行的领域适应性调整显著提高了质量保证性能，并可在不同领域和语言间通用，在 ODQA 方面可与最先进的系统竞争，甚至超越它们。最后，SESAME 是一款开源工具，所有代码、数据集和实验数据均可在我们的资源库中公开使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Intelligent Information Systems 工程技术-计算机：人工智能

CiteScore

7.20

自引率

11.80%

发文量

审稿时长

6-12 weeks

期刊介绍： The mission of the Journal of Intelligent Information Systems: Integrating Artifical Intelligence and Database Technologies is to foster and present research and development results focused on the integration of artificial intelligence and database technologies to create next generation information systems - Intelligent Information Systems. These new information systems embody knowledge that allows them to exhibit intelligent behavior, cooperate with users and other systems in problem solving, discovery, access, retrieval and manipulation of a wide variety of multimedia data and knowledge, and reason under uncertainty. Increasingly, knowledge-directed inference processes are being used to: discover knowledge from large data collections, provide cooperative support to users in complex query formulation and refinement, access, retrieve, store and manage large collections of multimedia data and knowledge, integrate information from multiple heterogeneous data and knowledge sources, and reason about information under uncertain conditions. Multimedia and hypermedia information systems now operate on a global scale over the Internet, and new tools and techniques are needed to manage these dynamic and evolving information spaces. The Journal of Intelligent Information Systems provides a forum wherein academics, researchers and practitioners may publish high-quality, original and state-of-the-art papers describing theoretical aspects, systems architectures, analysis and design tools and techniques, and implementation experiences in intelligent information systems. The categories of papers published by JIIS include: research papers, invited papters, meetings, workshop and conference annoucements and reports, survey and tutorial articles, and book reviews. Short articles describing open problems or their solutions are also welcome.