CONCORD: enhancing COVID-19 research with weak-supervision based numerical claim extraction

IF 3.4 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Intelligent Information Systems Pub Date : 2024-09-17 DOI:10.1007/s10844-024-00885-6

Dhwanil Shah, Krish Shah, Manan Jagani, Agam Shah, Bhaskar Chaudhury

{"title":"CONCORD: enhancing COVID-19 research with weak-supervision based numerical claim extraction","authors":"Dhwanil Shah, Krish Shah, Manan Jagani, Agam Shah, Bhaskar Chaudhury","doi":"10.1007/s10844-024-00885-6","DOIUrl":null,"url":null,"abstract":"The COVID-19 Numerical Claims Open Research Dataset (CONCORD) is a comprehensive, open-source dataset that extracts numerical claims from academic papers on COVID-19 research. A weak-supervision model is employed for this extraction, taking advantage of its white-box, explainable nature and reduced computational and annotation costs compared to transformer-based models. This model uses labelling functions such as pattern matching, external knowledge bases, phrase matching, and third-party models to generate labels, with an aggregator function handling contradictory labels. Evaluated against established baselines, the model achieved a weighted F1-score of 0.932 and a micro F1-score of 0.930. While transformer-based models achieve comparable results, the explainability of weak-supervision offers distinct advantages. Additionally, generative LLMs were tested to understand their effectiveness in extracting numerical claims, highlighting the impact of prompt engineering on performance. CONCORD contains approximately 200,000 numerical claims from over 57,000 COVID-19 research articles, serving as a valuable resource for tracking developments in COVID-19 research. This dataset, coupled with the weak-supervision approach, provides researchers with a significant tool for advancing COVID-19 research and showcases the potential of these methodologies in the broader biomedical field.","PeriodicalId":56119,"journal":{"name":"Journal of Intelligent Information Systems","volume":"11 1","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intelligent Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10844-024-00885-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The COVID-19 Numerical Claims Open Research Dataset (CONCORD) is a comprehensive, open-source dataset that extracts numerical claims from academic papers on COVID-19 research. A weak-supervision model is employed for this extraction, taking advantage of its white-box, explainable nature and reduced computational and annotation costs compared to transformer-based models. This model uses labelling functions such as pattern matching, external knowledge bases, phrase matching, and third-party models to generate labels, with an aggregator function handling contradictory labels. Evaluated against established baselines, the model achieved a weighted F1-score of 0.932 and a micro F1-score of 0.930. While transformer-based models achieve comparable results, the explainability of weak-supervision offers distinct advantages. Additionally, generative LLMs were tested to understand their effectiveness in extracting numerical claims, highlighting the impact of prompt engineering on performance. CONCORD contains approximately 200,000 numerical claims from over 57,000 COVID-19 research articles, serving as a valuable resource for tracking developments in COVID-19 research. This dataset, coupled with the weak-supervision approach, provides researchers with a significant tool for advancing COVID-19 research and showcases the potential of these methodologies in the broader biomedical field.

Abstract Image

查看原文本刊更多论文

CONCORD：利用基于弱监督的数字索赔提取加强 COVID-19 研究

COVID-19 数字索赔开放研究数据集（CONCORD）是一个全面的开源数据集，可从有关 COVID-19 研究的学术论文中提取数字索赔。与基于转换器的模型相比，CONCORD 采用了弱监督模型，利用其白盒、可解释的特性，降低了计算和注释成本。该模型使用模式匹配、外部知识库、短语匹配和第三方模型等标签功能生成标签，并使用聚合器功能处理相互矛盾的标签。根据既定基线进行评估，该模型的加权 F1 分数为 0.932，微 F1 分数为 0.930。虽然基于转换器的模型取得了不相上下的结果，但弱监督的可解释性具有明显的优势。此外，我们还对生成式 LLM 进行了测试，以了解它们在提取数字主张方面的有效性，从而突出提示工程对性能的影响。CONCORD 包含了来自 57,000 多篇 COVID-19 研究文章中的约 200,000 条数值声明，是跟踪 COVID-19 研究发展的宝贵资源。该数据集与弱监督方法相结合，为研究人员提供了推进 COVID-19 研究的重要工具，并展示了这些方法在更广泛的生物医学领域的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Intelligent Information Systems 工程技术-计算机：人工智能

CiteScore

7.20

自引率

11.80%

发文量

审稿时长

6-12 weeks

期刊介绍： The mission of the Journal of Intelligent Information Systems: Integrating Artifical Intelligence and Database Technologies is to foster and present research and development results focused on the integration of artificial intelligence and database technologies to create next generation information systems - Intelligent Information Systems. These new information systems embody knowledge that allows them to exhibit intelligent behavior, cooperate with users and other systems in problem solving, discovery, access, retrieval and manipulation of a wide variety of multimedia data and knowledge, and reason under uncertainty. Increasingly, knowledge-directed inference processes are being used to: discover knowledge from large data collections, provide cooperative support to users in complex query formulation and refinement, access, retrieve, store and manage large collections of multimedia data and knowledge, integrate information from multiple heterogeneous data and knowledge sources, and reason about information under uncertain conditions. Multimedia and hypermedia information systems now operate on a global scale over the Internet, and new tools and techniques are needed to manage these dynamic and evolving information spaces. The Journal of Intelligent Information Systems provides a forum wherein academics, researchers and practitioners may publish high-quality, original and state-of-the-art papers describing theoretical aspects, systems architectures, analysis and design tools and techniques, and implementation experiences in intelligent information systems. The categories of papers published by JIIS include: research papers, invited papters, meetings, workshop and conference annoucements and reports, survey and tutorial articles, and book reviews. Short articles describing open problems or their solutions are also welcome.