使用人工智能来支持系统评价的方法学质量评估。

IF 5.2 2区 医学 Q1 HEALTH CARE SCIENCES & SERVICES
Manuel Marques-Cruz , Filipe Pinto , Rafael José Vieira , Antonio Bognanni , Paula Perestrelo , Sara Gil-Mata , Vítor Henrique Duarte , José Pedro Barbosa , António Cardoso-Fernandes , Daniel Martinho-Dias , Francisco Franco-Pego , Federico Germini , Chiara Arienti , Alexandro W.L. Chu , Pau Riera-Serra , Paweł Jemioło , Pedro Pereira Rodrigues , João A. Fonseca , Luís Filipe Azevedo , Holger J. Schünemann , Bernardo Sousa-Pinto
{"title":"使用人工智能来支持系统评价的方法学质量评估。","authors":"Manuel Marques-Cruz ,&nbsp;Filipe Pinto ,&nbsp;Rafael José Vieira ,&nbsp;Antonio Bognanni ,&nbsp;Paula Perestrelo ,&nbsp;Sara Gil-Mata ,&nbsp;Vítor Henrique Duarte ,&nbsp;José Pedro Barbosa ,&nbsp;António Cardoso-Fernandes ,&nbsp;Daniel Martinho-Dias ,&nbsp;Francisco Franco-Pego ,&nbsp;Federico Germini ,&nbsp;Chiara Arienti ,&nbsp;Alexandro W.L. Chu ,&nbsp;Pau Riera-Serra ,&nbsp;Paweł Jemioło ,&nbsp;Pedro Pereira Rodrigues ,&nbsp;João A. Fonseca ,&nbsp;Luís Filipe Azevedo ,&nbsp;Holger J. Schünemann ,&nbsp;Bernardo Sousa-Pinto","doi":"10.1016/j.jclinepi.2025.111944","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>Published systematic reviews display a heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop an LLM-based tool for supporting the assessment of the methodological quality of systematic reviews.</div></div><div><h3>Methods</h3><div>We assessed the performance of 8 LLMs in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (Reported Methodological Quality (ReMarQ)). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison.</div></div><div><h3>Results</h3><div>The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy = 96.5% [95% CI = 89.9%–100%]; mean kappa coefficient = 0.90 [95% CI = 0.71–1.00]; mean F1-score = 0.91 [95% CI = 0.83–1.00]). This model displayed an accuracy &gt;80% and a kappa coefficient &gt;0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (ie, were always the same) and only 11% of assessed systematic reviews showed inconsistency.</div></div><div><h3>Conclusion</h3><div>Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"187 ","pages":"Article 111944"},"PeriodicalIF":5.2000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Use of artificial intelligence to support the assessment of the methodological quality of systematic reviews\",\"authors\":\"Manuel Marques-Cruz ,&nbsp;Filipe Pinto ,&nbsp;Rafael José Vieira ,&nbsp;Antonio Bognanni ,&nbsp;Paula Perestrelo ,&nbsp;Sara Gil-Mata ,&nbsp;Vítor Henrique Duarte ,&nbsp;José Pedro Barbosa ,&nbsp;António Cardoso-Fernandes ,&nbsp;Daniel Martinho-Dias ,&nbsp;Francisco Franco-Pego ,&nbsp;Federico Germini ,&nbsp;Chiara Arienti ,&nbsp;Alexandro W.L. Chu ,&nbsp;Pau Riera-Serra ,&nbsp;Paweł Jemioło ,&nbsp;Pedro Pereira Rodrigues ,&nbsp;João A. Fonseca ,&nbsp;Luís Filipe Azevedo ,&nbsp;Holger J. Schünemann ,&nbsp;Bernardo Sousa-Pinto\",\"doi\":\"10.1016/j.jclinepi.2025.111944\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objectives</h3><div>Published systematic reviews display a heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop an LLM-based tool for supporting the assessment of the methodological quality of systematic reviews.</div></div><div><h3>Methods</h3><div>We assessed the performance of 8 LLMs in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (Reported Methodological Quality (ReMarQ)). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison.</div></div><div><h3>Results</h3><div>The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy = 96.5% [95% CI = 89.9%–100%]; mean kappa coefficient = 0.90 [95% CI = 0.71–1.00]; mean F1-score = 0.91 [95% CI = 0.83–1.00]). This model displayed an accuracy &gt;80% and a kappa coefficient &gt;0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (ie, were always the same) and only 11% of assessed systematic reviews showed inconsistency.</div></div><div><h3>Conclusion</h3><div>Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items.</div></div>\",\"PeriodicalId\":51079,\"journal\":{\"name\":\"Journal of Clinical Epidemiology\",\"volume\":\"187 \",\"pages\":\"Article 111944\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Clinical Epidemiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S089543562500277X\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S089543562500277X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

已发表的系统综述显示出异质性的方法学质量,这可能影响决策。大型语言模型(llm)可以支持并使系统评价的方法学质量评估更有效,有助于将其证据纳入指南建议。我们的目标是开发一个基于法学硕士的工具来支持系统评价的方法学质量评估。方法:我们评估了8种大型语言模型(llm)在评估系统评价方法学质量方面的表现。特别是,我们为8个法学硕士(5个基本模型和3个微调模型)提供了100个系统评价,以基于27项验证工具(ReMarQ)评估其方法学质量。这些经过微调的模型是用300个人工评估的系统评论的不同样本进行训练的。我们将法学硕士提供的答案与人类审稿人独立提供的答案进行了比较,计算了这种比较的准确性、kappa系数和f1分。结果:优化后的GPT-3.5模型LLM表现最佳,平均准确率为96.5% [95%CI=89.9-100%],平均kappa系数为0.90 [95%CI=0.71-1.00],平均F1-score=0.91 [95%CI=0.83-1.00]。该模型对所有单项的准确率为>80%,kappa系数>0.60。当我们让LLM评估60次相同的系统评价时,27个项目中有18个的答案总是一致的(即,总是相同的),只有11%的评估系统评价显示不一致。结论:总体而言,法学硕士有潜力准确地支持基于包含二分项目的有效工具的系统评价的方法学质量评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Use of artificial intelligence to support the assessment of the methodological quality of systematic reviews

Objectives

Published systematic reviews display a heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop an LLM-based tool for supporting the assessment of the methodological quality of systematic reviews.

Methods

We assessed the performance of 8 LLMs in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (Reported Methodological Quality (ReMarQ)). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison.

Results

The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy = 96.5% [95% CI = 89.9%–100%]; mean kappa coefficient = 0.90 [95% CI = 0.71–1.00]; mean F1-score = 0.91 [95% CI = 0.83–1.00]). This model displayed an accuracy >80% and a kappa coefficient >0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (ie, were always the same) and only 11% of assessed systematic reviews showed inconsistency.

Conclusion

Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Clinical Epidemiology
Journal of Clinical Epidemiology 医学-公共卫生、环境卫生与职业卫生
CiteScore
12.00
自引率
6.90%
发文量
320
审稿时长
44 days
期刊介绍: The Journal of Clinical Epidemiology strives to enhance the quality of clinical and patient-oriented healthcare research by advancing and applying innovative methods in conducting, presenting, synthesizing, disseminating, and translating research results into optimal clinical practice. Special emphasis is placed on training new generations of scientists and clinical practice leaders.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信