High-performance automated abstract screening with large language model ensembles.

IF 4.7 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Rohan Sanghera, Arun James Thirunavukarasu, Marc El Khoury, Jessica O'Logbon, Yuqing Chen, Archie Watt, Mustafa Mahmood, Hamid Butt, George Nishimura, Andrew A S Soltan
{"title":"High-performance automated abstract screening with large language model ensembles.","authors":"Rohan Sanghera, Arun James Thirunavukarasu, Marc El Khoury, Jessica O'Logbon, Yuqing Chen, Archie Watt, Mustafa Mahmood, Hamid Butt, George Nishimura, Andrew A S Soltan","doi":"10.1093/jamia/ocaf050","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>screening is a labor-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies. We aimed to validate large language models (LLMs) used to automate abstract screening.</p><p><strong>Materials and methods: </strong>LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialed across 23 Cochrane Library systematic reviews to evaluate their accuracy in zero-shot binary classification for abstract screening. Initial evaluation on a balanced development dataset (n = 800) identified optimal prompting strategies, and the best performing LLM-prompt combinations were then validated on a comprehensive dataset of replicated search results (n = 119 695).</p><p><strong>Results: </strong>On the development dataset, LLMs exhibited superior performance to human researchers in terms of sensitivity (LLMmax = 1.000, humanmax = 0.775), precision (LLMmax = 0.927, humanmax = 0.911), and balanced accuracy (LLMmax = 0.904, humanmax = 0.865). When evaluated on the comprehensive dataset, the best performing LLM-prompt combinations exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096) due to class imbalance. In addition, 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458 with the development dataset, decreasing to 0.1450 over the comprehensive dataset; but conferring workload reductions ranging between 37.55% and 99.11%.</p><p><strong>Discussion: </strong>Automated abstract screening can reduce the screening workload in systematic review while maintaining quality. Performance variation between reviews highlights the importance of domain-specific validation before autonomous deployment. LLM-human ensembles can achieve similar benefits while maintaining human oversight over all records.</p><p><strong>Conclusion: </strong>LLMs may reduce the human labor cost of systematic review with maintained or improved accuracy, thereby increasing the efficiency and quality of evidence synthesis.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7000,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf050","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: screening is a labor-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies. We aimed to validate large language models (LLMs) used to automate abstract screening.

Materials and methods: LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialed across 23 Cochrane Library systematic reviews to evaluate their accuracy in zero-shot binary classification for abstract screening. Initial evaluation on a balanced development dataset (n = 800) identified optimal prompting strategies, and the best performing LLM-prompt combinations were then validated on a comprehensive dataset of replicated search results (n = 119 695).

Results: On the development dataset, LLMs exhibited superior performance to human researchers in terms of sensitivity (LLMmax = 1.000, humanmax = 0.775), precision (LLMmax = 0.927, humanmax = 0.911), and balanced accuracy (LLMmax = 0.904, humanmax = 0.865). When evaluated on the comprehensive dataset, the best performing LLM-prompt combinations exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096) due to class imbalance. In addition, 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458 with the development dataset, decreasing to 0.1450 over the comprehensive dataset; but conferring workload reductions ranging between 37.55% and 99.11%.

Discussion: Automated abstract screening can reduce the screening workload in systematic review while maintaining quality. Performance variation between reviews highlights the importance of domain-specific validation before autonomous deployment. LLM-human ensembles can achieve similar benefits while maintaining human oversight over all records.

Conclusion: LLMs may reduce the human labor cost of systematic review with maintained or improved accuracy, thereby increasing the efficiency and quality of evidence synthesis.

具有大型语言模型集成的高性能自动抽象筛选。
目的:筛选是系统评价的劳动密集型组成部分,涉及对大量研究重复应用纳入和排除标准。我们的目标是验证用于自动化抽象筛选的大型语言模型(llm)。材料和方法:在23篇Cochrane图书馆系统综述中对LLMs (GPT-3.5 Turbo、GPT-4 Turbo、gpt - 40、Llama 370b、Gemini 1.5 Pro和Claude Sonnet 3.5)进行试验,以评估它们在零概率二元分类中进行摘要筛选的准确性。在平衡发展数据集(n = 800)上的初步评估确定了最佳提示策略,然后在复制搜索结果的综合数据集(n = 119 695)上验证了表现最佳的llm提示组合。结果:在开发数据集上,LLMmax在灵敏度(LLMmax = 1.000, humanmax = 0.775)、精度(LLMmax = 0.927, humanmax = 0.911)和平衡精度(LLMmax = 0.904, humanmax = 0.865)方面均优于人类研究人员。当在综合数据集上进行评估时,表现最好的llm提示组合表现出一致的灵敏度(范围0.756-1.000),但由于类别不平衡而降低了精度(范围0.004-0.096)。此外,66个LLM-human和LLM-LLM - llm组合在开发数据集上的最大精度为0.458,在综合数据集上的最大精度降至0.1450;但工作量减少幅度在37.55%到99.11%之间。讨论:自动化摘要筛选可以在保证质量的同时减少系统评审中的筛选工作量。审查之间的性能差异突出了在自主部署之前特定领域验证的重要性。LLM-human合奏可以实现类似的好处,同时保持人类对所有记录的监督。结论:llm可以降低系统评价的人力成本,保持或提高系统评价的准确性,从而提高证据合成的效率和质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of the American Medical Informatics Association
Journal of the American Medical Informatics Association 医学-计算机:跨学科应用
CiteScore
14.50
自引率
7.80%
发文量
230
审稿时长
3-8 weeks
期刊介绍: JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信