Bowen Yao, Onuralp Ergun, Maylynn Ding, Carly D Miller, Vikram M Narayan, Philipp Dahm
{"title":"使用生成式大型语言模型评估系统综述的方法学质量。","authors":"Bowen Yao, Onuralp Ergun, Maylynn Ding, Carly D Miller, Vikram M Narayan, Philipp Dahm","doi":"10.5489/cuaj.9243","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).</p><p><strong>Methods: </strong>A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicates, with differences adjudicated by a third expert. We created a customized GPT \"Urology AMSTAR 2 Quality Assessor\" and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial was calculated against human results. Internal validity among three trials were computed.</p><p><strong>Results: </strong>GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.</p><p><strong>Conclusions: </strong>GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.</p>","PeriodicalId":50613,"journal":{"name":"Cuaj-Canadian Urological Association Journal","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing the methodologic quality of systematic reviews using generative large language models.\",\"authors\":\"Bowen Yao, Onuralp Ergun, Maylynn Ding, Carly D Miller, Vikram M Narayan, Philipp Dahm\",\"doi\":\"10.5489/cuaj.9243\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).</p><p><strong>Methods: </strong>A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicates, with differences adjudicated by a third expert. We created a customized GPT \\\"Urology AMSTAR 2 Quality Assessor\\\" and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial was calculated against human results. Internal validity among three trials were computed.</p><p><strong>Results: </strong>GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.</p><p><strong>Conclusions: </strong>GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.</p>\",\"PeriodicalId\":50613,\"journal\":{\"name\":\"Cuaj-Canadian Urological Association Journal\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cuaj-Canadian Urological Association Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.5489/cuaj.9243\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"UROLOGY & NEPHROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cuaj-Canadian Urological Association Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.5489/cuaj.9243","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
Assessing the methodologic quality of systematic reviews using generative large language models.
Introduction: We aimed to evaluate whether generative large language models (LLMs) can accurately assess the methodologic quality of systematic reviews (SRs).
Methods: A total of 114 SRs from five leading urology journals were included in the study. Human reviewers graded each of the SRs in duplicates, with differences adjudicated by a third expert. We created a customized GPT "Urology AMSTAR 2 Quality Assessor" and graded the 114 SRs in three iterations using a zero-shot method. We performed an enhanced trial focusing on critical criteria by giving GPT detailed, step-by-step instructions for each of the SRs using chain-of-thought method. Accuracy, sensitivity, specificity, and F1 score for each GPT trial was calculated against human results. Internal validity among three trials were computed.
Results: GPT had an overall congruence of 75%, with 77% in critical criteria and 73% in non-critical criteria when compared to human results. The average F1 score was 0.66. There was a high internal validity at 85% among three iterations. GPT accurately assigned 89% of studies into the correct overall category. When given specific, step-by-step instructions, congruence of critical criteria improved to 91%, and overall quality assessment accuracy to 93%.
Conclusions: GPT showed promising ability to efficiently and accurately assess the quality of SRs in urology.
期刊介绍:
CUAJ is a a peer-reviewed, open-access journal devoted to promoting the highest standard of urological patient care through the publication of timely, relevant, evidence-based research and advocacy information.