Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation.

IF 3.9 3区医学 Q1 HEALTH CARE SCIENCES & SERVICES

BMC Medical Research Methodology Pub Date : 2025-04-28 DOI:10.1186/s12874-025-02569-3

Xiangming Cai, Yuanming Geng, Yiming Du, Bart Westerman, Duolao Wang, Chiyuan Ma, Juan J Garcia Vallejo

{"title":"Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation.","authors":"Xiangming Cai, Yuanming Geng, Yiming Du, Bart Westerman, Duolao Wang, Chiyuan Ma, Juan J Garcia Vallejo","doi":"10.1186/s12874-025-02569-3","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) like ChatGPT showed great potential in aiding medical research. A heavy workload in filtering records is needed during the research process of evidence-based medicine, especially meta-analysis. However, few studies tried to use LLMs to help screen records in meta-analysis.Objective: In this research, we aimed to explore the possibility of incorporating multiple LLMs to facilitate the screening step based on the title and abstract of records during meta-analysis.Methods: Various LLMs were evaluated, which includes GPT-3.5, GPT-4, Deepseek-R1-Distill, Qwen-2.5, Phi-4, Llama-3.1, Gemma-2 and Claude-2. To assess our strategy, we selected three meta-analyses from the literature, together with a glioma meta-analysis embedded in the study, as additional validation. For the automatic selection of records from curated meta-analyses, a four-step strategy called LARS-GPT was developed, consisting of (1) criteria selection and single-prompt (prompt with one criterion) creation, (2) best combination identification, (3) combined-prompt (prompt with one or more criteria) creation, and (4) request sending and answer summary. Recall, workload reduction, precision, and F1 score were calculated to assess the performance of LARS-GPT.Results: A variable performance was found between different single-prompts, with a mean recall of 0.800. Based on these single-prompts, we were able to find combinations with better performance than the pre-set threshold. Finally, with a best combination of criteria identified, LARS-GPT showed a 40.1% workload reduction on average with a recall greater than 0.9.Conclusions: We show here the groundbreaking finding that automatic selection of literature for meta-analysis is possible with LLMs. We provide it here as a pipeline, LARS-GPT, which showed a great workload reduction while maintaining a pre-set recall.","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"116"},"PeriodicalIF":3.9000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12036192/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-025-02569-3","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) like ChatGPT showed great potential in aiding medical research. A heavy workload in filtering records is needed during the research process of evidence-based medicine, especially meta-analysis. However, few studies tried to use LLMs to help screen records in meta-analysis.

Objective: In this research, we aimed to explore the possibility of incorporating multiple LLMs to facilitate the screening step based on the title and abstract of records during meta-analysis.

Methods: Various LLMs were evaluated, which includes GPT-3.5, GPT-4, Deepseek-R1-Distill, Qwen-2.5, Phi-4, Llama-3.1, Gemma-2 and Claude-2. To assess our strategy, we selected three meta-analyses from the literature, together with a glioma meta-analysis embedded in the study, as additional validation. For the automatic selection of records from curated meta-analyses, a four-step strategy called LARS-GPT was developed, consisting of (1) criteria selection and single-prompt (prompt with one criterion) creation, (2) best combination identification, (3) combined-prompt (prompt with one or more criteria) creation, and (4) request sending and answer summary. Recall, workload reduction, precision, and F1 score were calculated to assess the performance of LARS-GPT.

Results: A variable performance was found between different single-prompts, with a mean recall of 0.800. Based on these single-prompts, we were able to find combinations with better performance than the pre-set threshold. Finally, with a best combination of criteria identified, LARS-GPT showed a 40.1% workload reduction on average with a recall greater than 0.9.

Conclusions: We show here the groundbreaking finding that automatic selection of literature for meta-analysis is possible with LLMs. We provide it here as a pipeline, LARS-GPT, which showed a great workload reduction while maintaining a pre-set recall.

查看原文本刊更多论文

使用大型语言模型来选择文献进行meta分析，可以减少工作量，同时保持与手动管理相似的召回水平。

背景：像ChatGPT这样的大型语言模型（llm）在辅助医学研究方面显示出巨大的潜力。在循证医学的研究过程中，尤其是荟萃分析，需要大量的记录过滤工作。然而，很少有研究尝试使用llm来帮助筛选meta分析中的记录。目的：在本研究中，我们旨在探讨合并多个llm的可能性，以促进meta分析中基于记录标题和摘要的筛选步骤。方法：对GPT-3.5、GPT-4、Deepseek-R1-Distill、Qwen-2.5、Phi-4、Llama-3.1、Gemma-2、claud -2等llm进行评价。为了评估我们的策略，我们从文献中选择了三个荟萃分析，以及研究中嵌入的胶质瘤荟萃分析，作为额外的验证。为了从精心策划的荟萃分析中自动选择记录，开发了一种称为LARS-GPT的四步策略，包括(1)标准选择和单提示（具有一个标准的提示）创建，(2)最佳组合识别，(3)组合提示（具有一个或多个标准的提示）创建，以及(4)请求发送和回答摘要。计算召回率、工作量减少、精确度和F1评分来评估LARS-GPT的性能。结果：在不同的单一提示之间发现了不同的表现，平均召回率为0.800。基于这些单一提示，我们能够找到比预先设置的阈值性能更好的组合。最后，通过确定标准的最佳组合，LARS-GPT平均减少了40.1%的工作量，召回率大于0.9。结论：我们在这里展示了突破性的发现，自动选择文献进行荟萃分析是可能的法学硕士。我们在这里提供了一个管道，LARS-GPT，在保持预先设定的召回的同时，大大减少了工作量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Research Methodology 医学-卫生保健

CiteScore

6.50

自引率

2.50%

发文量

298

审稿时长

3-8 weeks

期刊介绍： BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.