Evaluating the effectiveness of large language models in abstract screening: a comparative analysis.

IF 6.3 4区医学 Q1 MEDICINE, GENERAL & INTERNAL

Systematic Reviews Pub Date : 2024-08-21 DOI:10.1186/s13643-024-02609-x

Michael Li, Jianping Sun, Xianming Tan

{"title":"Evaluating the effectiveness of large language models in abstract screening: a comparative analysis.","authors":"Michael Li, Jianping Sun, Xianming Tan","doi":"10.1186/s13643-024-02609-x","DOIUrl":null,"url":null,"abstract":"Objective: This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows.Methods: We developed automation scripts in Python to interact with the APIs of several LLM tools, including ChatGPT v4.0, ChatGPT v3.5, Google PaLM 2, and Meta Llama 2, and latest tools including ChatGPT v4.0 turbo, ChatGPT v3.5 turbo, Google Gemini 1.0 pro, Meta Llama 3, and Claude 3. This study focused on three databases of abstracts and used them as benchmarks to evaluate the performance of these LLM tools in terms of sensitivity, specificity, and overall accuracy. The results of the LLM tools were compared to human-curated inclusion decisions, gold standard for systematic review and meta-analysis studies.Results: Different LLM tools had varying abilities in abstract screening. Chat GPT v4.0 demonstrated remarkable performance, with balanced sensitivity and specificity, and overall accuracy consistently reaching or exceeding 90%, indicating a high potential for LLMs in abstract screening tasks. The study found that LLMs could provide reliable results with minimal human effort and thus serve as a cost-effective and efficient alternative to traditional abstract screening methods.Conclusion: While LLM tools are not yet ready to completely replace human experts in abstract screening, they show great promise in revolutionizing the process. They can serve as autonomous AI reviewers, contribute to collaborative workflows with human experts, and integrate with hybrid approaches to develop custom tools for increased efficiency. As technology continues to advance, LLMs are poised to play an increasingly important role in abstract screening, reshaping the workflow of systematic review and meta-analysis studies.","PeriodicalId":22162,"journal":{"name":"Systematic Reviews","volume":"13 1","pages":"219"},"PeriodicalIF":6.3000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11337893/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Reviews","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13643-024-02609-x","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows.

Methods: We developed automation scripts in Python to interact with the APIs of several LLM tools, including ChatGPT v4.0, ChatGPT v3.5, Google PaLM 2, and Meta Llama 2, and latest tools including ChatGPT v4.0 turbo, ChatGPT v3.5 turbo, Google Gemini 1.0 pro, Meta Llama 3, and Claude 3. This study focused on three databases of abstracts and used them as benchmarks to evaluate the performance of these LLM tools in terms of sensitivity, specificity, and overall accuracy. The results of the LLM tools were compared to human-curated inclusion decisions, gold standard for systematic review and meta-analysis studies.

Results: Different LLM tools had varying abilities in abstract screening. Chat GPT v4.0 demonstrated remarkable performance, with balanced sensitivity and specificity, and overall accuracy consistently reaching or exceeding 90%, indicating a high potential for LLMs in abstract screening tasks. The study found that LLMs could provide reliable results with minimal human effort and thus serve as a cost-effective and efficient alternative to traditional abstract screening methods.

Conclusion: While LLM tools are not yet ready to completely replace human experts in abstract screening, they show great promise in revolutionizing the process. They can serve as autonomous AI reviewers, contribute to collaborative workflows with human experts, and integrate with hybrid approaches to develop custom tools for increased efficiency. As technology continues to advance, LLMs are poised to play an increasingly important role in abstract screening, reshaping the workflow of systematic review and meta-analysis studies.

查看原文本刊更多论文

评估抽象筛选中大型语言模型的有效性：比较分析。

研究目的本研究旨在评估大型语言模型（LLMs）在系统综述和荟萃分析研究的摘要筛选任务中的性能，探索其有效性、效率以及与现有的基于人类专家的工作流程整合的可能性：我们用 Python 开发了自动化脚本，以便与几种 LLM 工具的 API 进行交互，包括 ChatGPT v4.0、ChatGPT v3.5、Google PaLM 2 和 Meta Llama 2，以及最新的工具，包括 ChatGPT v4.0 turbo、ChatGPT v3.5 turbo、Google Gemini 1.0 pro、Meta Llama 3 和 Claude 3。本研究以三个文摘数据库为重点，以它们为基准，评估了这些 LLM 工具在灵敏度、特异性和总体准确性方面的性能。将 LLM 工具的结果与人工编辑的纳入决定（系统综述和荟萃分析研究的黄金标准）进行了比较：结果：不同的 LLM 工具在摘要筛选方面能力各异。Chat GPT v4.0表现突出，灵敏度和特异性均衡，总体准确率一直达到或超过90%，这表明LLM在抽象筛选任务中具有很大的潜力。研究发现，LLM 只需极少的人力就能提供可靠的结果，因此可作为传统抽象筛选方法的一种经济高效的替代方法：虽然 LLM 工具还不能完全取代人类专家进行摘要筛选，但它们在彻底改变这一过程方面展现出了巨大的前景。它们可以作为自主的人工智能审稿人，为与人类专家的协作工作流程做出贡献，还可以与混合方法相结合，开发定制工具以提高效率。随着技术的不断进步，LLM 将在摘要筛选中发挥越来越重要的作用，重塑系统综述和荟萃分析研究的工作流程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Systematic Reviews Medicine-Medicine (miscellaneous)

CiteScore

8.30

自引率

0.00%

发文量

241

审稿时长

11 weeks

期刊介绍： Systematic Reviews encompasses all aspects of the design, conduct and reporting of systematic reviews. The journal publishes high quality systematic review products including systematic review protocols, systematic reviews related to a very broad definition of health, rapid reviews, updates of already completed systematic reviews, and methods research related to the science of systematic reviews, such as decision modelling. At this time Systematic Reviews does not accept reviews of in vitro studies. The journal also aims to ensure that the results of all well-conducted systematic reviews are published, regardless of their outcome.