Manuel Marques-Cruz , Filipe Pinto , Rafael José Vieira , Antonio Bognanni , Paula Perestrelo , Sara Gil-Mata , Vítor Henrique Duarte , José Pedro Barbosa , António Cardoso-Fernandes , Daniel Martinho-Dias , Francisco Franco-Pego , Federico Germini , Chiara Arienti , Alexandro W.L. Chu , Pau Riera-Serra , Paweł Jemioło , Pedro Pereira Rodrigues , João A. Fonseca , Luís Filipe Azevedo , Holger J. Schünemann , Bernardo Sousa-Pinto
{"title":"使用人工智能来支持系统评价的方法学质量评估。","authors":"Manuel Marques-Cruz , Filipe Pinto , Rafael José Vieira , Antonio Bognanni , Paula Perestrelo , Sara Gil-Mata , Vítor Henrique Duarte , José Pedro Barbosa , António Cardoso-Fernandes , Daniel Martinho-Dias , Francisco Franco-Pego , Federico Germini , Chiara Arienti , Alexandro W.L. Chu , Pau Riera-Serra , Paweł Jemioło , Pedro Pereira Rodrigues , João A. Fonseca , Luís Filipe Azevedo , Holger J. Schünemann , Bernardo Sousa-Pinto","doi":"10.1016/j.jclinepi.2025.111944","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>Published systematic reviews display a heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop an LLM-based tool for supporting the assessment of the methodological quality of systematic reviews.</div></div><div><h3>Methods</h3><div>We assessed the performance of 8 LLMs in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (Reported Methodological Quality (ReMarQ)). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison.</div></div><div><h3>Results</h3><div>The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy = 96.5% [95% CI = 89.9%–100%]; mean kappa coefficient = 0.90 [95% CI = 0.71–1.00]; mean F1-score = 0.91 [95% CI = 0.83–1.00]). This model displayed an accuracy >80% and a kappa coefficient >0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (ie, were always the same) and only 11% of assessed systematic reviews showed inconsistency.</div></div><div><h3>Conclusion</h3><div>Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"187 ","pages":"Article 111944"},"PeriodicalIF":5.2000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Use of artificial intelligence to support the assessment of the methodological quality of systematic reviews\",\"authors\":\"Manuel Marques-Cruz , Filipe Pinto , Rafael José Vieira , Antonio Bognanni , Paula Perestrelo , Sara Gil-Mata , Vítor Henrique Duarte , José Pedro Barbosa , António Cardoso-Fernandes , Daniel Martinho-Dias , Francisco Franco-Pego , Federico Germini , Chiara Arienti , Alexandro W.L. Chu , Pau Riera-Serra , Paweł Jemioło , Pedro Pereira Rodrigues , João A. Fonseca , Luís Filipe Azevedo , Holger J. Schünemann , Bernardo Sousa-Pinto\",\"doi\":\"10.1016/j.jclinepi.2025.111944\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objectives</h3><div>Published systematic reviews display a heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop an LLM-based tool for supporting the assessment of the methodological quality of systematic reviews.</div></div><div><h3>Methods</h3><div>We assessed the performance of 8 LLMs in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (Reported Methodological Quality (ReMarQ)). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison.</div></div><div><h3>Results</h3><div>The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy = 96.5% [95% CI = 89.9%–100%]; mean kappa coefficient = 0.90 [95% CI = 0.71–1.00]; mean F1-score = 0.91 [95% CI = 0.83–1.00]). This model displayed an accuracy >80% and a kappa coefficient >0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (ie, were always the same) and only 11% of assessed systematic reviews showed inconsistency.</div></div><div><h3>Conclusion</h3><div>Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items.</div></div>\",\"PeriodicalId\":51079,\"journal\":{\"name\":\"Journal of Clinical Epidemiology\",\"volume\":\"187 \",\"pages\":\"Article 111944\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Clinical Epidemiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S089543562500277X\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S089543562500277X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Use of artificial intelligence to support the assessment of the methodological quality of systematic reviews
Objectives
Published systematic reviews display a heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop an LLM-based tool for supporting the assessment of the methodological quality of systematic reviews.
Methods
We assessed the performance of 8 LLMs in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (Reported Methodological Quality (ReMarQ)). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison.
Results
The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy = 96.5% [95% CI = 89.9%–100%]; mean kappa coefficient = 0.90 [95% CI = 0.71–1.00]; mean F1-score = 0.91 [95% CI = 0.83–1.00]). This model displayed an accuracy >80% and a kappa coefficient >0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (ie, were always the same) and only 11% of assessed systematic reviews showed inconsistency.
Conclusion
Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items.
期刊介绍:
The Journal of Clinical Epidemiology strives to enhance the quality of clinical and patient-oriented healthcare research by advancing and applying innovative methods in conducting, presenting, synthesizing, disseminating, and translating research results into optimal clinical practice. Special emphasis is placed on training new generations of scientists and clinical practice leaders.