Validation of automated paper screening for esophagectomy systematic review using large language models.

IF 3.5 4区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
PeerJ Computer Science Pub Date : 2025-04-30 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.2822
Rashi Ramchandani, Eddie Guo, Esra Rakab, Jharna Rathod, Jamie Strain, William Klement, Risa Shorr, Erin Williams, Daniel Jones, Sebastien Gilbert
{"title":"Validation of automated paper screening for esophagectomy systematic review using large language models.","authors":"Rashi Ramchandani, Eddie Guo, Esra Rakab, Jharna Rathod, Jamie Strain, William Klement, Risa Shorr, Erin Williams, Daniel Jones, Sebastien Gilbert","doi":"10.7717/peerj-cs.2822","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) offer a potential solution to the labor-intensive nature of systematic reviews. This study evaluated the ability of the GPT model to identify articles that discuss perioperative risk factors for esophagectomy complications. To test the performance of the model, we tested GPT-4 on narrower inclusion criterion and by assessing its ability to discriminate relevant articles that solely identified preoperative risk factors for esophagectomy.</p><p><strong>Methods: </strong>A literature search was run by a trained librarian to identify studies (<i>n</i> = 1,967) discussing risk factors to esophagectomy complications. The articles underwent title and abstract screening by three independent human reviewers and GPT-4. The Python script used for the analysis made Application Programming Interface (API) calls to GPT-4 with screening criteria in natural language. GPT-4's inclusion and exclusion decision were compared to those decided human reviewers.</p><p><strong>Results: </strong>The agreement between the GPT model and human decision was 85.58% for perioperative factors and 78.75% for preoperative factors. The AUC value was 0.87 and 0.75 for the perioperative and preoperative risk factors query, respectively. In the evaluation of perioperative risk factors, the GPT model demonstrated a high recall for included studies at 89%, a positive predictive value of 74%, and a negative predictive value of 84%, with a low false positive rate of 6% and a macro-F1 score of 0.81. For preoperative risk factors, the model showed a recall of 67% for included studies, a positive predictive value of 65%, and a negative predictive value of 85%, with a false positive rate of 15% and a macro-F1 score of 0.66. The interobserver reliability was substantial, with a kappa score of 0.69 for perioperative factors and 0.61 for preoperative factors. Despite lower accuracy under more stringent criteria, the GPT model proved valuable in streamlining the systematic review workflow. Preliminary evaluation of inclusion and exclusion justification provided by the GPT model were reported to have been useful by study screeners, especially in resolving discrepancies during title and abstract screening.</p><p><strong>Conclusion: </strong>This study demonstrates promising use of LLMs to streamline the workflow of systematic reviews. The integration of LLMs in systematic reviews could lead to significant time and cost savings, however caution must be taken for reviews involving stringent a narrower and exclusion criterion. Future research is needed and should explore integrating LLMs in other steps of the systematic review, such as full text screening or data extraction, and compare different LLMs for their effectiveness in various types of systematic reviews.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2822"},"PeriodicalIF":3.5000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12190591/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2822","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Large language models (LLMs) offer a potential solution to the labor-intensive nature of systematic reviews. This study evaluated the ability of the GPT model to identify articles that discuss perioperative risk factors for esophagectomy complications. To test the performance of the model, we tested GPT-4 on narrower inclusion criterion and by assessing its ability to discriminate relevant articles that solely identified preoperative risk factors for esophagectomy.

Methods: A literature search was run by a trained librarian to identify studies (n = 1,967) discussing risk factors to esophagectomy complications. The articles underwent title and abstract screening by three independent human reviewers and GPT-4. The Python script used for the analysis made Application Programming Interface (API) calls to GPT-4 with screening criteria in natural language. GPT-4's inclusion and exclusion decision were compared to those decided human reviewers.

Results: The agreement between the GPT model and human decision was 85.58% for perioperative factors and 78.75% for preoperative factors. The AUC value was 0.87 and 0.75 for the perioperative and preoperative risk factors query, respectively. In the evaluation of perioperative risk factors, the GPT model demonstrated a high recall for included studies at 89%, a positive predictive value of 74%, and a negative predictive value of 84%, with a low false positive rate of 6% and a macro-F1 score of 0.81. For preoperative risk factors, the model showed a recall of 67% for included studies, a positive predictive value of 65%, and a negative predictive value of 85%, with a false positive rate of 15% and a macro-F1 score of 0.66. The interobserver reliability was substantial, with a kappa score of 0.69 for perioperative factors and 0.61 for preoperative factors. Despite lower accuracy under more stringent criteria, the GPT model proved valuable in streamlining the systematic review workflow. Preliminary evaluation of inclusion and exclusion justification provided by the GPT model were reported to have been useful by study screeners, especially in resolving discrepancies during title and abstract screening.

Conclusion: This study demonstrates promising use of LLMs to streamline the workflow of systematic reviews. The integration of LLMs in systematic reviews could lead to significant time and cost savings, however caution must be taken for reviews involving stringent a narrower and exclusion criterion. Future research is needed and should explore integrating LLMs in other steps of the systematic review, such as full text screening or data extraction, and compare different LLMs for their effectiveness in various types of systematic reviews.

使用大型语言模型验证食管切除术系统评价的自动纸张筛选。
背景:大型语言模型(llm)为系统评审的劳动密集型本质提供了一个潜在的解决方案。本研究评估了GPT模型识别讨论食管切除术并发症围手术期危险因素的文章的能力。为了测试模型的性能,我们对GPT-4进行了更窄的纳入标准测试,并通过评估其区分仅识别食管切除术术前危险因素的相关文章的能力。方法:由一位训练有素的图书管理员进行文献检索,以确定讨论食管切除术并发症危险因素的研究(n = 1967)。文章的标题和摘要由三位独立的人类审稿人和GPT-4进行筛选。用于分析的Python脚本使用自然语言的筛选标准对GPT-4进行应用程序编程接口(API)调用。将GPT-4的纳入和排除决定与人工审稿人的决定进行比较。结果:GPT模型对围手术期因素和术前因素的一致性分别为85.58%和78.75%。围手术期和术前危险因素查询的AUC分别为0.87和0.75。在围手术期危险因素的评估中,GPT模型对纳入研究的召回率为89%,阳性预测值为74%,阴性预测值为84%,假阳性率为6%,宏观f1评分为0.81。对于术前危险因素,该模型对纳入研究的召回率为67%,阳性预测值为65%,阴性预测值为85%,假阳性率为15%,宏观f1评分为0.66。观察者间的信度很高,围手术期因素的kappa评分为0.69,术前因素的kappa评分为0.61。尽管在更严格的标准下准确性较低,GPT模型在简化系统审查工作流程方面被证明是有价值的。据报道,GPT模型提供的纳入和排除理由的初步评估对研究筛选者很有用,特别是在解决标题和摘要筛选期间的差异时。结论:本研究展示了llm在简化系统评价工作流程方面的应用前景。将法学硕士集成到系统评审中可以节省大量的时间和成本,但是必须谨慎对待涉及严格的狭窄和排除标准的评审。未来的研究需要并且应该探索将法学硕士整合到系统评价的其他步骤中,例如全文筛选或数据提取,并比较不同的法学硕士在不同类型的系统评价中的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
PeerJ Computer Science
PeerJ Computer Science Computer Science-General Computer Science
CiteScore
6.10
自引率
5.30%
发文量
332
审稿时长
10 weeks
期刊介绍: PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信