Petter Fagerberg, Oscar Sallander, Kim Vikhe Patil, Anders Berg, Anastasia Nyman, Natalia Borg, Thomas Lindén
{"title":"Batch Size Effects on Mid-2025 State-of-the-Art Large Language Model Performance in Automated Title and Abstract Screening","authors":"Petter Fagerberg, Oscar Sallander, Kim Vikhe Patil, Anders Berg, Anastasia Nyman, Natalia Borg, Thomas Lindén","doi":"10.1002/cesm.70082","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Manual abstract screening is a primary bottleneck in evidence synthesis. Emerging evidence suggests that large language models (LLMs) can automate this task, but their performance when processing multiple references simultaneously in “batches” is uncertain.</p>\n </section>\n \n <section>\n \n <h3> Objectives</h3>\n \n <p>To evaluate the classification performance of four state-of-the-art LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, and GPT-5 mini) in predicting reference eligibility across a wide range of batch sizes for a systematic review of randomized controlled trials.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We used a gold-standard dataset of 790 references (93 considered relevant) from a published Cochrane Review on stem cell treatment for acute myocardial infarction. Using the public APIs for each model, batches of 1 to 790 references were submitted to classify each as “Include” or “Exclude.” Performance was assessed using sensitivity and specificity, with internal validation conducted through 10 repeated runs for each model-batch combination.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Gemini 2.5 Pro was the most robust model, successfully processing the full 790-reference batch. In contrast, GPT-5 failed at batches ≥400, while GPT-5 mini and Gemini 2.5 Flash failed at the 790-reference batch. Overall, all models demonstrated strong performance within their operational ranges, with two notable exceptions: Gemini 2.5 Flash showed low initial sensitivity at batch 1, and GPT-5 mini's sensitivity degraded at higher batch sizes (from 0.88 at batch 200 to 0.48 at batch 400). At a practical batch size of 100, Gemini 2.5 Pro achieved the highest sensitivity (1.00, 95% CI 1.00–1.00), whereas GPT-5 delivered the highest specificity (0.98, 95% CI 0.98–0.98).</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>State-of-the-art LLMs can effectively screen multiple abstracts per prompt, moving beyond inefficient single-reference processing. However, performance is model-dependent, revealing trade-offs between sensitivity and specificity. Therefore, batch size optimization and strategic model selection are important parameters for successful implementation.</p>\n </section>\n </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"4 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2026-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13073229/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.70082","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Manual abstract screening is a primary bottleneck in evidence synthesis. Emerging evidence suggests that large language models (LLMs) can automate this task, but their performance when processing multiple references simultaneously in “batches” is uncertain.
Objectives
To evaluate the classification performance of four state-of-the-art LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, and GPT-5 mini) in predicting reference eligibility across a wide range of batch sizes for a systematic review of randomized controlled trials.
Methods
We used a gold-standard dataset of 790 references (93 considered relevant) from a published Cochrane Review on stem cell treatment for acute myocardial infarction. Using the public APIs for each model, batches of 1 to 790 references were submitted to classify each as “Include” or “Exclude.” Performance was assessed using sensitivity and specificity, with internal validation conducted through 10 repeated runs for each model-batch combination.
Results
Gemini 2.5 Pro was the most robust model, successfully processing the full 790-reference batch. In contrast, GPT-5 failed at batches ≥400, while GPT-5 mini and Gemini 2.5 Flash failed at the 790-reference batch. Overall, all models demonstrated strong performance within their operational ranges, with two notable exceptions: Gemini 2.5 Flash showed low initial sensitivity at batch 1, and GPT-5 mini's sensitivity degraded at higher batch sizes (from 0.88 at batch 200 to 0.48 at batch 400). At a practical batch size of 100, Gemini 2.5 Pro achieved the highest sensitivity (1.00, 95% CI 1.00–1.00), whereas GPT-5 delivered the highest specificity (0.98, 95% CI 0.98–0.98).
Conclusion
State-of-the-art LLMs can effectively screen multiple abstracts per prompt, moving beyond inefficient single-reference processing. However, performance is model-dependent, revealing trade-offs between sensitivity and specificity. Therefore, batch size optimization and strategic model selection are important parameters for successful implementation.