Assessing the methodologic quality of systematic reviews using generative large language models.

IF 2 4区 医学 Q3 UROLOGY & NEPHROLOGY
Bowen Yao, Onuralp Ergun, Maylynn Ding, Carly D Miller, Vikram M Narayan, Philipp Dahm
{"title":"Assessing the methodologic quality of systematic reviews using generative large language models.","authors":"Bowen Yao, Onuralp Ergun, Maylynn Ding, Carly D Miller, Vikram M Narayan, Philipp Dahm","doi":"10.5489/cuaj.9243","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).</p><p><strong>Methods: </strong>A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicates, with differences adjudicated by a third expert. We created a customized GPT \"Urology AMSTAR 2 Quality Assessor\" and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial was calculated against human results. Internal validity among three trials were computed.</p><p><strong>Results: </strong>GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.</p><p><strong>Conclusions: </strong>GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.</p>","PeriodicalId":50613,"journal":{"name":"Cuaj-Canadian Urological Association Journal","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cuaj-Canadian Urological Association Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.5489/cuaj.9243","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).

Methods: A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicates, with differences adjudicated by a third expert. We created a customized GPT "Urology AMSTAR 2 Quality Assessor" and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial was calculated against human results. Internal validity among three trials were computed.

Results: GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.

Conclusions: GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.

使用生成式大型语言模型评估系统综述的方法学质量。
我们旨在评估生成式大型语言模型(llm)是否能够准确评估系统评价(SRs)的方法学质量。方法:选取5种泌尿外科主要期刊114篇论文。人类审稿人对每个SRs进行了重复评分,差异由第三位专家裁决。我们创建了一个定制的GPT“泌尿学AMSTAR 2质量评估器”,并使用零射击法在三次迭代中对114例SRs进行评分。我们执行了一个增强的试验,重点关注关键标准,通过使用思维链方法为每个SRs提供GPT详细的、逐步的说明。每个GPT试验的准确性、敏感性、特异性和F1评分都是根据人体结果计算的。计算三个试验的内部效度。结果:与人类结果相比,GPT总体一致性为75%,关键标准为77%,非关键标准为73%。平均F1得分为0.66。在三次迭代中,内部效度高达85%。GPT准确地将89%的研究分配到正确的总体类别中。当给出具体的、一步一步的指导时,关键标准的一致性提高到91%,整体质量评估的准确性提高到93%。结论:GPT能有效、准确地评估泌尿外科SRs的质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Cuaj-Canadian Urological Association Journal
Cuaj-Canadian Urological Association Journal 医学-泌尿学与肾脏学
CiteScore
2.80
自引率
10.50%
发文量
167
审稿时长
>12 weeks
期刊介绍: CUAJ is a a peer-reviewed, open-access journal devoted to promoting the highest standard of urological patient care through the publication of timely, relevant, evidence-based research and advocacy information.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信