Batch Size Effects on Mid-2025 State-of-the-Art Large Language Model Performance in Automated Title and Abstract Screening

Cochrane Evidence Synthesis and Methods Pub Date : 2026-04-11 DOI:10.1002/cesm.70082

Petter Fagerberg, Oscar Sallander, Kim Vikhe Patil, Anders Berg, Anastasia Nyman, Natalia Borg, Thomas Lindén

{"title":"Batch Size Effects on Mid-2025 State-of-the-Art Large Language Model Performance in Automated Title and Abstract Screening","authors":"Petter Fagerberg, Oscar Sallander, Kim Vikhe Patil, Anders Berg, Anastasia Nyman, Natalia Borg, Thomas Lindén","doi":"10.1002/cesm.70082","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Manual abstract screening is a primary bottleneck in evidence synthesis. Emerging evidence suggests that large language models (LLMs) can automate this task, but their performance when processing multiple references simultaneously in “batches” is uncertain.</p>\n </section>\n \n <section>\n \n <h3> Objectives</h3>\n \n <p>To evaluate the classification performance of four state-of-the-art LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, and GPT-5 mini) in predicting reference eligibility across a wide range of batch sizes for a systematic review of randomized controlled trials.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We used a gold-standard dataset of 790 references (93 considered relevant) from a published Cochrane Review on stem cell treatment for acute myocardial infarction. Using the public APIs for each model, batches of 1 to 790 references were submitted to classify each as “Include” or “Exclude.” Performance was assessed using sensitivity and specificity, with internal validation conducted through 10 repeated runs for each model-batch combination.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Gemini 2.5 Pro was the most robust model, successfully processing the full 790-reference batch. In contrast, GPT-5 failed at batches ≥400, while GPT-5 mini and Gemini 2.5 Flash failed at the 790-reference batch. Overall, all models demonstrated strong performance within their operational ranges, with two notable exceptions: Gemini 2.5 Flash showed low initial sensitivity at batch 1, and GPT-5 mini's sensitivity degraded at higher batch sizes (from 0.88 at batch 200 to 0.48 at batch 400). At a practical batch size of 100, Gemini 2.5 Pro achieved the highest sensitivity (1.00, 95% CI 1.00–1.00), whereas GPT-5 delivered the highest specificity (0.98, 95% CI 0.98–0.98).</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>State-of-the-art LLMs can effectively screen multiple abstracts per prompt, moving beyond inefficient single-reference processing. However, performance is model-dependent, revealing trade-offs between sensitivity and specificity. Therefore, batch size optimization and strategic model selection are important parameters for successful implementation.</p>\n </section>\n </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"4 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2026-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13073229/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.70082","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Manual abstract screening is a primary bottleneck in evidence synthesis. Emerging evidence suggests that large language models (LLMs) can automate this task, but their performance when processing multiple references simultaneously in “batches” is uncertain.

Objectives

To evaluate the classification performance of four state-of-the-art LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, and GPT-5 mini) in predicting reference eligibility across a wide range of batch sizes for a systematic review of randomized controlled trials.

Methods

We used a gold-standard dataset of 790 references (93 considered relevant) from a published Cochrane Review on stem cell treatment for acute myocardial infarction. Using the public APIs for each model, batches of 1 to 790 references were submitted to classify each as “Include” or “Exclude.” Performance was assessed using sensitivity and specificity, with internal validation conducted through 10 repeated runs for each model-batch combination.

Results

Gemini 2.5 Pro was the most robust model, successfully processing the full 790-reference batch. In contrast, GPT-5 failed at batches ≥400, while GPT-5 mini and Gemini 2.5 Flash failed at the 790-reference batch. Overall, all models demonstrated strong performance within their operational ranges, with two notable exceptions: Gemini 2.5 Flash showed low initial sensitivity at batch 1, and GPT-5 mini's sensitivity degraded at higher batch sizes (from 0.88 at batch 200 to 0.48 at batch 400). At a practical batch size of 100, Gemini 2.5 Pro achieved the highest sensitivity (1.00, 95% CI 1.00–1.00), whereas GPT-5 delivered the highest specificity (0.98, 95% CI 0.98–0.98).

Conclusion

State-of-the-art LLMs can effectively screen multiple abstracts per prompt, moving beyond inefficient single-reference processing. However, performance is model-dependent, revealing trade-offs between sensitivity and specificity. Therefore, batch size optimization and strategic model selection are important parameters for successful implementation.

Abstract Image

查看原文本刊更多论文

批量大小对2025年中期最先进的大型语言模型自动标题和摘要筛选性能的影响。

背景：人工摘要筛选是证据合成的主要瓶颈。越来越多的证据表明，大型语言模型（llm）可以自动完成这项任务，但它们在“批量”同时处理多个引用时的性能是不确定的。目的：评估四种最先进的LLMs （Gemini 2.5 Pro、Gemini 2.5 Flash、GPT-5和GPT-5 mini）在预测大范围批量随机对照试验的参考资格方面的分类性能。方法：我们使用了一个金标准数据集，其中790篇参考文献（93篇被认为相关）来自Cochrane综述发表的关于干细胞治疗急性心肌梗死的文章。使用每个模型的公共api，提交1到790个引用的批次，将每个引用分类为“包括”或“排除”。使用敏感性和特异性评估性能，并通过对每个模型-批次组合进行10次重复运行进行内部验证。结果：Gemini 2.5 Pro是最稳健的模型，成功处理了完整的790个参考批次。相比之下，GPT-5在批次≥400时失败，而GPT-5 mini和Gemini 2.5 Flash在790参考批次时失败。总体而言，所有型号都在其操作范围内表现出强大的性能，但有两个明显的例外：Gemini 2.5 Flash在第一批时显示出较低的初始灵敏度，而GPT-5 mini的灵敏度在较大的批量时下降（从第200批的0.88到第400批的0.48）。在实际批量为100的情况下，Gemini 2.5 Pro具有最高的灵敏度（1.00,95% CI 1.00-1.00），而GPT-5具有最高的特异性（0.98,95% CI 0.98-0.98）。结论：最先进的llm可以有效地在一个提示中筛选多个摘要，超越了低效的单一参考文献处理。然而，性能依赖于模型，揭示敏感性和特异性之间的权衡。因此，批大小优化和策略模型选择是成功实施的重要参数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cochrane Evidence Synthesis and Methods

自引率

0.00%

发文量