使用生成式大型语言模型评估系统综述的方法学质量。

IF 2 4区医学 Q3 UROLOGY & NEPHROLOGY

Cuaj-Canadian Urological Association Journal Pub Date : 2025-08-28 DOI:10.5489/cuaj.9243

Bowen Yao, Onuralp Ergun, Maylynn Ding, Carly D Miller, Vikram M Narayan, Philipp Dahm

{"title":"使用生成式大型语言模型评估系统综述的方法学质量。","authors":"Bowen Yao, Onuralp Ergun, Maylynn Ding, Carly D Miller, Vikram M Narayan, Philipp Dahm","doi":"10.5489/cuaj.9243","DOIUrl":null,"url":null,"abstract":"Introduction: We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).Methods: A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicates, with differences adjudicated by a third expert. We created a customized GPT \"Urology AMSTAR 2 Quality Assessor\" and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial was calculated against human results. Internal validity among three trials were computed.Results: GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.Conclusions: GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.","PeriodicalId":50613,"journal":{"name":"Cuaj-Canadian Urological Association Journal","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing the methodologic quality of systematic reviews using generative large language models.\",\"authors\":\"Bowen Yao, Onuralp Ergun, Maylynn Ding, Carly D Miller, Vikram M Narayan, Philipp Dahm\",\"doi\":\"10.5489/cuaj.9243\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).Methods: A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicates, with differences adjudicated by a third expert. We created a customized GPT \\\"Urology AMSTAR 2 Quality Assessor\\\" and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial was calculated against human results. Internal validity among three trials were computed.Results: GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.Conclusions: GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.\",\"PeriodicalId\":50613,\"journal\":{\"name\":\"Cuaj-Canadian Urological Association Journal\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cuaj-Canadian Urological Association Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.5489/cuaj.9243\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"UROLOGY & NEPHROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cuaj-Canadian Urological Association Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.5489/cuaj.9243","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}

引用次数: 0

摘要

我们旨在评估生成式大型语言模型（llm）是否能够准确评估系统评价（SRs）的方法学质量。方法：选取5种泌尿外科主要期刊114篇论文。人类审稿人对每个SRs进行了重复评分，差异由第三位专家裁决。我们创建了一个定制的GPT“泌尿学AMSTAR 2质量评估器”，并使用零射击法在三次迭代中对114例SRs进行评分。我们执行了一个增强的试验，重点关注关键标准，通过使用思维链方法为每个SRs提供GPT详细的、逐步的说明。每个GPT试验的准确性、敏感性、特异性和F1评分都是根据人体结果计算的。计算三个试验的内部效度。结果：与人类结果相比，GPT总体一致性为75%，关键标准为77%，非关键标准为73%。平均F1得分为0.66。在三次迭代中，内部效度高达85%。GPT准确地将89%的研究分配到正确的总体类别中。当给出具体的、一步一步的指导时，关键标准的一致性提高到91%，整体质量评估的准确性提高到93%。结论：GPT能有效、准确地评估泌尿外科SRs的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Assessing the methodologic quality of systematic reviews using generative large language models.

Introduction: We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).

Methods: A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicates, with differences adjudicated by a third expert. We created a customized GPT "Urology AMSTAR 2 Quality Assessor" and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial was calculated against human results. Internal validity among three trials were computed.

Results: GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.

Conclusions: GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Cuaj-Canadian Urological Association Journal 医学-泌尿学与肾脏学

CiteScore

2.80

自引率

10.50%

发文量

167

审稿时长

>12 weeks

期刊介绍： CUAJ is a a peer-reviewed, open-access journal devoted to promoting the highest standard of urological patient care through the publication of timely, relevant, evidence-based research and advocacy information.