Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models.

IF 4.7 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Diego A Forero, Sandra E Abreu, Blanca E Tovar, Marilyn H Oermann
{"title":"Automated analyses of risk of bias and critical appraisal of systematic reviews (ROBIS and AMSTAR 2): a comparison of the performance of 4 large language models.","authors":"Diego A Forero, Sandra E Abreu, Blanca E Tovar, Marilyn H Oermann","doi":"10.1093/jamia/ocaf117","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>To explore the performance of 4 large language model (LLM) chatbots for the analysis of 2 of the most commonly used tools for the advanced analysis of systematic reviews (SRs) and meta-analyses.</p><p><strong>Materials and methods: </strong>We explored the performance of 4 LLM chatbots (ChatGPT, Gemini, DeepSeek, and QWEN) for the analysis of ROBIS and AMSTAR 2 tools (sample sizes: 20 SRs), in comparison with assessments by human experts.</p><p><strong>Results: </strong>Gemini showed the best agreement with human experts for both ROBIS and AMSTAR 2 (accuracy: 58% and 70%). The second best LLM chatbots were ChatGPT and QWEN, for ROBIS and AMSTAR 2, respectively.</p><p><strong>Discussion: </strong>Some LLM chatbots underestimated the risk of bias or overestimated the confidence of the results in published SRs, which is compatible with recent articles for other tools.</p><p><strong>Conclusion: </strong>This is one of the first studies comparing the performance of several LLM chatbots for the automated analyses of ROBIS and AMSTAR 2.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf117","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: To explore the performance of 4 large language model (LLM) chatbots for the analysis of 2 of the most commonly used tools for the advanced analysis of systematic reviews (SRs) and meta-analyses.

Materials and methods: We explored the performance of 4 LLM chatbots (ChatGPT, Gemini, DeepSeek, and QWEN) for the analysis of ROBIS and AMSTAR 2 tools (sample sizes: 20 SRs), in comparison with assessments by human experts.

Results: Gemini showed the best agreement with human experts for both ROBIS and AMSTAR 2 (accuracy: 58% and 70%). The second best LLM chatbots were ChatGPT and QWEN, for ROBIS and AMSTAR 2, respectively.

Discussion: Some LLM chatbots underestimated the risk of bias or overestimated the confidence of the results in published SRs, which is compatible with recent articles for other tools.

Conclusion: This is one of the first studies comparing the performance of several LLM chatbots for the automated analyses of ROBIS and AMSTAR 2.

系统评价的偏差风险和批判性评估的自动分析(ROBIS和AMSTAR 2): 4个大型语言模型的性能比较。
目的:探讨4个大型语言模型(LLM)聊天机器人的性能,用于分析两种最常用的工具,用于系统评价(SRs)和荟萃分析的高级分析。材料和方法:我们探索了4个LLM聊天机器人(ChatGPT、Gemini、DeepSeek和QWEN)的性能,用于分析ROBIS和AMSTAR 2工具(样本量:20个SRs),并与人类专家的评估进行了比较。结果:Gemini与人类专家在ROBIS和AMSTAR 2上的一致性最好(准确率分别为58%和70%)。第二好的LLM聊天机器人是ChatGPT和QWEN,分别用于ROBIS和AMSTAR 2。讨论:一些法学硕士聊天机器人低估了偏见的风险,或者高估了已发表的SRs结果的置信度,这与其他工具的最新文章是兼容的。结论:这是比较几种LLM聊天机器人在ROBIS和AMSTAR 2自动分析中的性能的首批研究之一。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of the American Medical Informatics Association
Journal of the American Medical Informatics Association 医学-计算机:跨学科应用
CiteScore
14.50
自引率
7.80%
发文量
230
审稿时长
3-8 weeks
期刊介绍: JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信