Evaluating data extraction error by a large language model from randomised controlled trials: a large-scale empirical study.

IF 7.6 3区医学 Q1 MEDICINE, GENERAL & INTERNAL

BMJ Evidence-Based Medicine Pub Date : 2026-04-29 DOI:10.1136/bmjebm-2025-114044

Shiqi Fan, Ming Chen, Suhail A Doi, Zhangnan Ye, Zhen Peng, Yuan Tian, Chengzhi Zhang, Luis Furuya-Kanamori, Lifeng Lin, Evan Mayo-Wilson, Mohammad Hassan Murad, Xiuhong Meng, Chang Xu

{"title":"Evaluating data extraction error by a large language model from randomised controlled trials: a large-scale empirical study.","authors":"Shiqi Fan, Ming Chen, Suhail A Doi, Zhangnan Ye, Zhen Peng, Yuan Tian, Chengzhi Zhang, Luis Furuya-Kanamori, Lifeng Lin, Evan Mayo-Wilson, Mohammad Hassan Murad, Xiuhong Meng, Chang Xu","doi":"10.1136/bmjebm-2025-114044","DOIUrl":null,"url":null,"abstract":"Objective: To examine the potential errors of a general large language model (LLM) (ie, Claude 3.5 Sonnet) on data extraction from randomised controlled trials (RCTs).Design and setting: An empirical study comparing Claude 3.5 Sonnet extractions against a human-performed verification dataset. The extraction tasks for Claude 3.5 Sonnet were based solely on original RCT portable document format (PDF) files. For PDFs that could not be directly extracted by Claude 3.5 Sonnet, optical character recognition was employed to convert them into text format before extraction.Participants: A random sample of 664 trials was selected from a well-established trial bank and a final data pool was established based on rigorous manual cross-checking as a reference standard.Data sources: PubMed, EMBASE, Scopus, Web of Science (all databases) and the Cochrane Central Register of Controlled Trials (CENTRAL) up to February 2023.Eligibility criteria for selecting studies: RCTs on children involving medication and adverse events.Main outcome measures: Claude 3.5 Sonnet was applied to extract the basic information (eg, trial design, population information and source of funding) and adverse outcomes (ie, name of adverse events, number of events). Claude 3.5 Sonnet outputs were compared against the final data pool and all errors were recorded. Results are presented as error rates and with 95% CI, estimated using a generalised linear mixed model.Results: For the 664 trials, a total of 23 069 data cells were extracted via Claude 3.5 Sonnet, with 10 624 for basic information and 12 445 for adverse outcomes. The overall error rate for data extraction was 6.6% (95% CI 5.4% to 8.2%), with 5.7% (95% CI 5.2% to 6.1%) in basic information and 7.6% (95% CI 4.9% to 11.8%) in adverse outcomes. When stratified the 1542 total errors by error types, misallocation (assigning data to incorrect fields; 57.1%, 881/1542) and missed or omitted data (incomplete extraction of available data; 23.2%, 357/1542) accounted for the two most frequent errors, with misallocation occurring more in basic information (53.3%, 470/881), while missed or omitted data occurred more in adverse outcomes (96.1%, 343/357). Post hoc analysis examining the association between trial reporting quality (assessed using Consolidated Standards of Reporting Trials (CONSORT) 2025 and LLM data extraction error rates indicated that higher CONSORT adherence was associated with lower extraction error rates.Conclusions: The data extraction error of Claude was relatively low, but it alerts LLM applications in evidence synthesis. Detailed checking for LLM outputs should be the primary consideration for evidence synthesisers.","PeriodicalId":9059,"journal":{"name":"BMJ Evidence-Based Medicine","volume":" ","pages":""},"PeriodicalIF":7.6000,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Evidence-Based Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bmjebm-2025-114044","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To examine the potential errors of a general large language model (LLM) (ie, Claude 3.5 Sonnet) on data extraction from randomised controlled trials (RCTs).

Design and setting: An empirical study comparing Claude 3.5 Sonnet extractions against a human-performed verification dataset. The extraction tasks for Claude 3.5 Sonnet were based solely on original RCT portable document format (PDF) files. For PDFs that could not be directly extracted by Claude 3.5 Sonnet, optical character recognition was employed to convert them into text format before extraction.

Participants: A random sample of 664 trials was selected from a well-established trial bank and a final data pool was established based on rigorous manual cross-checking as a reference standard.

Data sources: PubMed, EMBASE, Scopus, Web of Science (all databases) and the Cochrane Central Register of Controlled Trials (CENTRAL) up to February 2023.

Eligibility criteria for selecting studies: RCTs on children involving medication and adverse events.

Main outcome measures: Claude 3.5 Sonnet was applied to extract the basic information (eg, trial design, population information and source of funding) and adverse outcomes (ie, name of adverse events, number of events). Claude 3.5 Sonnet outputs were compared against the final data pool and all errors were recorded. Results are presented as error rates and with 95% CI, estimated using a generalised linear mixed model.

Results: For the 664 trials, a total of 23 069 data cells were extracted via Claude 3.5 Sonnet, with 10 624 for basic information and 12 445 for adverse outcomes. The overall error rate for data extraction was 6.6% (95% CI 5.4% to 8.2%), with 5.7% (95% CI 5.2% to 6.1%) in basic information and 7.6% (95% CI 4.9% to 11.8%) in adverse outcomes. When stratified the 1542 total errors by error types, misallocation (assigning data to incorrect fields; 57.1%, 881/1542) and missed or omitted data (incomplete extraction of available data; 23.2%, 357/1542) accounted for the two most frequent errors, with misallocation occurring more in basic information (53.3%, 470/881), while missed or omitted data occurred more in adverse outcomes (96.1%, 343/357). Post hoc analysis examining the association between trial reporting quality (assessed using Consolidated Standards of Reporting Trials (CONSORT) 2025 and LLM data extraction error rates indicated that higher CONSORT adherence was associated with lower extraction error rates.

Conclusions: The data extraction error of Claude was relatively low, but it alerts LLM applications in evidence synthesis. Detailed checking for LLM outputs should be the primary consideration for evidence synthesisers.

查看原文本刊更多论文

评估随机对照试验中大型语言模型的数据提取误差：一项大规模实证研究。

目的：探讨一般大型语言模型（LLM）（即Claude 3.5 Sonnet）在随机对照试验（RCTs）数据提取方面的潜在误差。设计和设置：比较克劳德3.5十四行诗提取与人工执行的验证数据集的实证研究。克劳德3.5十四行诗的提取任务完全基于原始RCT便携式文档格式（PDF）文件。对于无法通过Claude 3.5 Sonnet直接提取的pdf文件，在提取之前使用光学字符识别将其转换为文本格式。参与者：从一个完善的试验库中随机抽取664个试验样本，并根据严格的人工交叉检查作为参考标准建立最终数据池。数据来源：截至2023年2月PubMed， EMBASE, Scopus, Web of Science（所有数据库）和Cochrane Central Register of Controlled Trials （Central）。入选标准：涉及药物和不良事件的儿童随机对照试验。主要结局指标：采用Claude 3.5 Sonnet提取基本信息（如试验设计、人群信息、资金来源）和不良结局（如不良事件名称、事件数量）。将Claude 3.5十四行诗输出与最终数据池进行比较，并记录所有错误。结果以错误率和95% CI表示，使用广义线性混合模型估计。结果：在664项试验中，通过Claude 3.5 Sonnet共提取了23 069个数据单元，其中10 624个用于基本信息，12 445个用于不良结果。数据提取的总体错误率为6.6% (95% CI 5.4% ~ 8.2%)，其中基本信息的错误率为5.7% (95% CI 5.2% ~ 6.1%)，不良结局的错误率为7.6% （95% CI 4.9% ~ 11.8%）。按错误类型对1542个总错误进行分层时，错配（将数据分配到不正确的字段，占57.1%,881/1542）和遗漏或遗漏数据（未完全提取可用数据，占23.2%,357/1542）是最常见的两个错误，其中错配多发生在基本信息（53.3%,470/881），漏配或遗漏数据多发生在不良结局（96.1%,343/357）。事后分析检验了试验报告质量（使用综合报告试验标准（CONSORT） 2025进行评估）与LLM数据提取错误率之间的关系，结果表明，较高的CONSORT依从性与较低的提取错误率相关。结论：Claude的数据提取误差相对较低，但也给LLM在证据合成中的应用敲响了警钟。对法学硕士输出的详细检查应该是证据合成器的主要考虑因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMJ Evidence-Based Medicine MEDICINE, GENERAL & INTERNAL-

CiteScore

8.90

自引率

3.40%

发文量

期刊介绍： BMJ Evidence-Based Medicine (BMJ EBM) publishes original evidence-based research, insights and opinions on what matters for health care. We focus on the tools, methods, and concepts that are basic and central to practising evidence-based medicine and deliver relevant, trustworthy and impactful evidence. BMJ EBM is a Plan S compliant Transformative Journal and adheres to the highest possible industry standards for editorial policies and publication ethics.