Using Artificial Intelligence Tools as Second Reviewers for Data Extraction in Systematic Reviews: A Performance Comparison of Two AI Tools Against Human Reviewers

Cochrane Evidence Synthesis and Methods Pub Date : 2025-07-14 DOI:10.1002/cesm.70036

T. Helms Andersen, T. M. Marcussen, A. D. Termannsen, T. W. H. Lawaetz, O. Nørgaard

{"title":"Using Artificial Intelligence Tools as Second Reviewers for Data Extraction in Systematic Reviews: A Performance Comparison of Two AI Tools Against Human Reviewers","authors":"T. Helms Andersen, T. M. Marcussen, A. D. Termannsen, T. W. H. Lawaetz, O. Nørgaard","doi":"10.1002/cesm.70036","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Systematic reviews are essential but time-consuming and expensive. Large language models (LLMs) and artificial intelligence (AI) tools could potentially automate data extraction, but no comprehensive workflow has been tested for different review types.</p>\n </section>\n \n <section>\n \n <h3> Objective</h3>\n \n <p>To evaluate Elicit's and ChatGPT's abilities to extract data from journal articles as a replacement for one of two human data extractors in systematic reviews.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Human-extracted data from three systematic reviews (30 articles in total) was compared to data extracted by Elicit and ChatGPT. The AI tools extracted population characteristics, study design, and review-specific variables. Performance metrics were calculated against human double-extracted data as the gold standard, followed by a detailed error analysis.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Precision, recall and F1-score were all 92% for Elicit and 91%, 89% and 90% for ChatGPT. Recall was highest for study design (Elicit: 100%; ChatGPT: 90%) and population characteristics (Elicit: 100%; ChatGPT: 97%), while review-specific variables achieved 77% in Elicit and 80% in ChatGPT. Elicit had four instances of confabulation while ChatGPT had three. There was no significant difference between the two AI tools' performance (recall difference: 3.3% points, 95% CI: –5.2%–11.9%, <i>p</i> = 0.445).</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>AI tools demonstrated high and similar performance in data extraction compared to human reviewers, particularly for standardized variables. Error analysis revealed confabulations in 4% of data points. We propose adopting AI-assisted extraction to replace the second human extractor, with the second human instead focusing on reconciling discrepancies between AI and the primary human extractor.</p>\n </section>\n </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70036","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.70036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Systematic reviews are essential but time-consuming and expensive. Large language models (LLMs) and artificial intelligence (AI) tools could potentially automate data extraction, but no comprehensive workflow has been tested for different review types.

Objective

To evaluate Elicit's and ChatGPT's abilities to extract data from journal articles as a replacement for one of two human data extractors in systematic reviews.

Methods

Human-extracted data from three systematic reviews (30 articles in total) was compared to data extracted by Elicit and ChatGPT. The AI tools extracted population characteristics, study design, and review-specific variables. Performance metrics were calculated against human double-extracted data as the gold standard, followed by a detailed error analysis.

Results

Precision, recall and F1-score were all 92% for Elicit and 91%, 89% and 90% for ChatGPT. Recall was highest for study design (Elicit: 100%; ChatGPT: 90%) and population characteristics (Elicit: 100%; ChatGPT: 97%), while review-specific variables achieved 77% in Elicit and 80% in ChatGPT. Elicit had four instances of confabulation while ChatGPT had three. There was no significant difference between the two AI tools' performance (recall difference: 3.3% points, 95% CI: –5.2%–11.9%, p = 0.445).

Conclusion

AI tools demonstrated high and similar performance in data extraction compared to human reviewers, particularly for standardized variables. Error analysis revealed confabulations in 4% of data points. We propose adopting AI-assisted extraction to replace the second human extractor, with the second human instead focusing on reconciling discrepancies between AI and the primary human extractor.

Abstract Image

查看原文本刊更多论文

使用人工智能工具作为系统评价中数据提取的第二审稿人：两种人工智能工具与人类审稿人的性能比较

系统审查是必要的，但耗时且昂贵。大型语言模型（llm）和人工智能（AI）工具可能会自动提取数据，但还没有针对不同审查类型测试过全面的工作流程。目的评估Elicit和ChatGPT从期刊文章中提取数据的能力，以取代系统评价中两个人工数据提取器中的一个。方法将人工提取的3篇系统综述（共30篇）的数据与Elicit和ChatGPT提取的数据进行比较。人工智能工具提取群体特征、研究设计和评论特定变量。性能指标是根据人类双重提取的数据作为金标准计算的，然后是详细的误差分析。结果Elicit的查全率、查全率和f1评分均为92%，ChatGPT的查全率分别为91%、89%和90%。研究设计的回忆率最高(引出：100%；ChatGPT: 90%)和群体特征(Elicit: 100%；ChatGPT: 97%)，而审查特定变量在Elicit中达到77%，在ChatGPT中达到80%。Elicit有4个虚构实例，而ChatGPT有3个。两种人工智能工具的性能没有显著差异（召回差：3.3%点，95% CI: -5.2%-11.9%, p = 0.445）。结论：人工智能工具在数据提取方面表现出与人类审稿人相似的高性能，尤其是在标准化变量方面。误差分析显示4%的数据点存在虚构。我们建议采用人工智能辅助提取来取代第二个人工提取器，而第二个人工提取器则专注于协调人工智能与主要人工提取器之间的差异。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cochrane Evidence Synthesis and Methods

自引率

0.00%

发文量