Shiqi Fan, Ming Chen, Suhail A Doi, Zhangnan Ye, Zhen Peng, Yuan Tian, Chengzhi Zhang, Luis Furuya-Kanamori, Lifeng Lin, Evan Mayo-Wilson, Mohammad Hassan Murad, Xiuhong Meng, Chang Xu
{"title":"Evaluating data extraction error by a large language model from randomised controlled trials: a large-scale empirical study.","authors":"Shiqi Fan, Ming Chen, Suhail A Doi, Zhangnan Ye, Zhen Peng, Yuan Tian, Chengzhi Zhang, Luis Furuya-Kanamori, Lifeng Lin, Evan Mayo-Wilson, Mohammad Hassan Murad, Xiuhong Meng, Chang Xu","doi":"10.1136/bmjebm-2025-114044","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To examine the potential errors of a general large language model (LLM) (ie, Claude 3.5 Sonnet) on data extraction from randomised controlled trials (RCTs).</p><p><strong>Design and setting: </strong>An empirical study comparing Claude 3.5 Sonnet extractions against a human-performed verification dataset. The extraction tasks for Claude 3.5 Sonnet were based solely on original RCT portable document format (PDF) files. For PDFs that could not be directly extracted by Claude 3.5 Sonnet, optical character recognition was employed to convert them into text format before extraction.</p><p><strong>Participants: </strong>A random sample of 664 trials was selected from a well-established trial bank and a final data pool was established based on rigorous manual cross-checking as a reference standard.</p><p><strong>Data sources: </strong>PubMed, EMBASE, Scopus, Web of Science (all databases) and the Cochrane Central Register of Controlled Trials (CENTRAL) up to February 2023.</p><p><strong>Eligibility criteria for selecting studies: </strong>RCTs on children involving medication and adverse events.</p><p><strong>Main outcome measures: </strong>Claude 3.5 Sonnet was applied to extract the basic information (eg, trial design, population information and source of funding) and adverse outcomes (ie, name of adverse events, number of events). Claude 3.5 Sonnet outputs were compared against the final data pool and all errors were recorded. Results are presented as error rates and with 95% CI, estimated using a generalised linear mixed model.</p><p><strong>Results: </strong>For the 664 trials, a total of 23 069 data cells were extracted via Claude 3.5 Sonnet, with 10 624 for basic information and 12 445 for adverse outcomes. The overall error rate for data extraction was 6.6% (95% CI 5.4% to 8.2%), with 5.7% (95% CI 5.2% to 6.1%) in basic information and 7.6% (95% CI 4.9% to 11.8%) in adverse outcomes. When stratified the 1542 total errors by error types, misallocation (assigning data to incorrect fields; 57.1%, 881/1542) and missed or omitted data (incomplete extraction of available data; 23.2%, 357/1542) accounted for the two most frequent errors, with misallocation occurring more in basic information (53.3%, 470/881), while missed or omitted data occurred more in adverse outcomes (96.1%, 343/357). Post hoc analysis examining the association between trial reporting quality (assessed using Consolidated Standards of Reporting Trials (CONSORT) 2025 and LLM data extraction error rates indicated that higher CONSORT adherence was associated with lower extraction error rates.</p><p><strong>Conclusions: </strong>The data extraction error of Claude was relatively low, but it alerts LLM applications in evidence synthesis. Detailed checking for LLM outputs should be the primary consideration for evidence synthesisers.</p>","PeriodicalId":9059,"journal":{"name":"BMJ Evidence-Based Medicine","volume":" ","pages":""},"PeriodicalIF":7.6000,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Evidence-Based Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bmjebm-2025-114044","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: To examine the potential errors of a general large language model (LLM) (ie, Claude 3.5 Sonnet) on data extraction from randomised controlled trials (RCTs).
Design and setting: An empirical study comparing Claude 3.5 Sonnet extractions against a human-performed verification dataset. The extraction tasks for Claude 3.5 Sonnet were based solely on original RCT portable document format (PDF) files. For PDFs that could not be directly extracted by Claude 3.5 Sonnet, optical character recognition was employed to convert them into text format before extraction.
Participants: A random sample of 664 trials was selected from a well-established trial bank and a final data pool was established based on rigorous manual cross-checking as a reference standard.
Data sources: PubMed, EMBASE, Scopus, Web of Science (all databases) and the Cochrane Central Register of Controlled Trials (CENTRAL) up to February 2023.
Eligibility criteria for selecting studies: RCTs on children involving medication and adverse events.
Main outcome measures: Claude 3.5 Sonnet was applied to extract the basic information (eg, trial design, population information and source of funding) and adverse outcomes (ie, name of adverse events, number of events). Claude 3.5 Sonnet outputs were compared against the final data pool and all errors were recorded. Results are presented as error rates and with 95% CI, estimated using a generalised linear mixed model.
Results: For the 664 trials, a total of 23 069 data cells were extracted via Claude 3.5 Sonnet, with 10 624 for basic information and 12 445 for adverse outcomes. The overall error rate for data extraction was 6.6% (95% CI 5.4% to 8.2%), with 5.7% (95% CI 5.2% to 6.1%) in basic information and 7.6% (95% CI 4.9% to 11.8%) in adverse outcomes. When stratified the 1542 total errors by error types, misallocation (assigning data to incorrect fields; 57.1%, 881/1542) and missed or omitted data (incomplete extraction of available data; 23.2%, 357/1542) accounted for the two most frequent errors, with misallocation occurring more in basic information (53.3%, 470/881), while missed or omitted data occurred more in adverse outcomes (96.1%, 343/357). Post hoc analysis examining the association between trial reporting quality (assessed using Consolidated Standards of Reporting Trials (CONSORT) 2025 and LLM data extraction error rates indicated that higher CONSORT adherence was associated with lower extraction error rates.
Conclusions: The data extraction error of Claude was relatively low, but it alerts LLM applications in evidence synthesis. Detailed checking for LLM outputs should be the primary consideration for evidence synthesisers.
期刊介绍:
BMJ Evidence-Based Medicine (BMJ EBM) publishes original evidence-based research, insights and opinions on what matters for health care. We focus on the tools, methods, and concepts that are basic and central to practising evidence-based medicine and deliver relevant, trustworthy and impactful evidence.
BMJ EBM is a Plan S compliant Transformative Journal and adheres to the highest possible industry standards for editorial policies and publication ethics.