评估 ChatGPT-4o 在偏差风险评估中的性能。

IF 3.6 2区 医学 Q1 MEDICINE, GENERAL & INTERNAL
Ilari Kuitunen, Ville T. Ponkilainen, Rasmus Liukkonen, Lauri Nyrhi, Oskari Pakarinen, Matias Vaajala, Mikko M. Uimonen
{"title":"评估 ChatGPT-4o 在偏差风险评估中的性能。","authors":"Ilari Kuitunen,&nbsp;Ville T. Ponkilainen,&nbsp;Rasmus Liukkonen,&nbsp;Lauri Nyrhi,&nbsp;Oskari Pakarinen,&nbsp;Matias Vaajala,&nbsp;Mikko M. Uimonen","doi":"10.1111/jebm.12662","DOIUrl":null,"url":null,"abstract":"<p>Systematic reviews and meta-analyses are a key part of evidence synthesis and are considered to provide the best possible information on intervention effectiveness [<span>1</span>]. A key part of the evidence synthesis is the critical appraisal of the included studies [<span>2</span>]. The risk of bias is typically assessed by using Cochrane's risk of bias original tool or the revised risk of bias 2.0 tool, both of which are outcome specific tools for randomized controlled trials (RCTs) [<span>3, 4</span>]. Risk of bias assessments are time-consuming in evidence synthesis projects [<span>5</span>]. Additionally, they have been shown to be susceptible to biases, even in top-tier medical journals and Cochrane reviews [<span>6-8</span>]. The interrater agreement has also shown to be varying between reviewers [<span>9</span>]. Therefore, there is a clear need for improvements in both the quality and efficiency of these evaluations.</p><p>The rise of large language models, such as OpenAI's ChatGPT, has led to an increase in the use of these in research. While challenges such as authorship disputes and data fabrication have arisen, these tools show great promise when used appropriately [<span>10</span>]. Two previous studies have evaluated the performance of ChatGPT in risk of bias assessments [<span>11, 12</span>]. One focused on ROBINS-I tool and found rather low agreement in it [<span>11</span>]. Another small study focused on risk of bias (RoB) 2.0 tool, and concluded that currently ChatGPT should not be used, but further studies would be needed [<span>12</span>]. The aim of our current study was to evaluate the performance of the most recent version of OpenAI's large language model ChatGPT-4o in the risk of bias assessment.</p><p>We conducted a systematic assessment of the performance of ChatGPT-4o in Cochranes RoB 2.0 tool analyses. First, we searched PubMed on July 31, 2024 for the most recent 50 meta-analyses published in top-level medical journals (<i>Lancet</i>, <i>JAMA</i> or <i>BMJ</i>). The results were uploaded to Covidence software for a screening process. Then, two authors (IK and OP) screened the reviews and included meta-analyses of interventions, which included only RCTs, and had used Cochrane RoB 2.0 tool as their risk of bias assessment tool. A total of six reviews were included (Figure S1). Then a third author (MV) extracted a total of 100 risk of bias assessments from these included reviews. A fourth author (LN) uploaded these 100 studies in pdf format to ChatGPT-4o with a standardized short prompt which was written to the text field. The prompt was: “Perform a risk of bias analysis according to the Cochrane group RoB2 guidelines for the following article and perform the assessment for the main outcome of the trial. Report results only as high, some concerns, low, no information for domains 1–5 and an overall assessment.” The complete list of the included RCTs and extracted risk of bias assessments and ChatGPT-4o assessments is provided in the Supplementary Material.</p><p>Finally, a fifth author (MU) performed the interrater agreement analyses where we compared ChatGPT-4o performance to the assessments extracted from the published reviews. We calculated weighted Fleiss’ kappa estimates with 95% confidence intervals (CIs). Statistical analyses were made in R version 4.2.1. A protocol was pre-registered to Open Science Framework (10.17605/OSF.IO/J67W4).</p><p>A total of 100 RCTs were included for the analysis. The weighted kappa for the overall risk of bias analysis for the primary outcome was 0.24 (95% CI 0.10 to 0.37). The domain-specific agreement was highest in the domain bias arising from the randomization process (kappa = 0.31, 95% CI 0.10 to 0.50), and lowest in the domain bias in selection of the reported results (kappa = –0.11, 95% CI –0.16 to –0.04). In bias due missing outcome data the kappa was 0.12 (95% CI 0.01 to 0.22), in bias due deviations from the interventions 0.06 (95% CI –0.12 to 0.23), and in bias due to measurement of the outcome –0.03 (–0.06 to 0.00). When comparing the given ratings, ChatGPT-4o labeled only three studies as high risk of bias for a single domain but classified none of the studies to have an overall high risk of bias (Figure 1).</p><p>Our current study found that ChatGPT-4o had slight agreement rate in the overall assessment, and slight agreement for bias due randomization domain, whereas the other domains varied between no agreement to poor agreement. Our results were in line with the previously published smaller report of ChatGPT-4o performance in RoB 2.0 assessment [<span>12</span>]. Compared to that study, we did not perform the risk of bias assessment ourselves. Furthermore, we had eight times more studies included, which increased the validity of our results. Another notable difference was that we used a standardized prompt systematically for all evaluations, which improved the standardization of the assessments.</p><p>The primary limitation of our study was the use of relatively simple prompts, as the performance of large language models was influenced by prompt quality. Nevertheless, this reflected a realistic scenario in which users of ChatGPT-4o might utilize it for critical appraisal without extensive knowledge of prompt engineering or awareness of potential weaknesses. Another limitation was that while we initially intended to compare our findings with assessments from Cochrane reviews. However, most Cochrane reviews still employed the original RoB tool [<span>13</span>], whereas the RoB 2.0 tool was more commonly used in high-impact medical journals, and thus we changed the original plan and extracted the assessments from top journals [<span>5</span>]. Furthermore, a limitation was that the interrater agreements were not reported in the included reviews, and thus comparisons directly between the agreements could not be made.</p><p>Future studies should focus on whether the performance of ChatGPT-4o could be enhanced by better prompting or by providing guidance by examples of risk of bias scenarios. Another focus could be how alternative large language models perform compared to human assessment, as well as to compare how different large language models perform against each other. Furthermore, an interesting idea to test could be whether a combination of humans and different large language models aiming to reach a majority voting consensus on the risk of bias assessments would improve the quality and the agreement.</p><p>We found that the performance of ChatGPT-4o was poor in the risk of bias analyses, and mainly, because it was too positive with the assessments. The findings of this study highlighted that ChatGPT-4o was not suitable for risk of bias assessments with simple prompts.</p><p>The authors declare no conflicts of interest.</p>","PeriodicalId":16090,"journal":{"name":"Journal of Evidence‐Based Medicine","volume":"17 4","pages":"700-702"},"PeriodicalIF":3.6000,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684499/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating the Performance of ChatGPT-4o in Risk of Bias Assessments\",\"authors\":\"Ilari Kuitunen,&nbsp;Ville T. Ponkilainen,&nbsp;Rasmus Liukkonen,&nbsp;Lauri Nyrhi,&nbsp;Oskari Pakarinen,&nbsp;Matias Vaajala,&nbsp;Mikko M. Uimonen\",\"doi\":\"10.1111/jebm.12662\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Systematic reviews and meta-analyses are a key part of evidence synthesis and are considered to provide the best possible information on intervention effectiveness [<span>1</span>]. A key part of the evidence synthesis is the critical appraisal of the included studies [<span>2</span>]. The risk of bias is typically assessed by using Cochrane's risk of bias original tool or the revised risk of bias 2.0 tool, both of which are outcome specific tools for randomized controlled trials (RCTs) [<span>3, 4</span>]. Risk of bias assessments are time-consuming in evidence synthesis projects [<span>5</span>]. Additionally, they have been shown to be susceptible to biases, even in top-tier medical journals and Cochrane reviews [<span>6-8</span>]. The interrater agreement has also shown to be varying between reviewers [<span>9</span>]. Therefore, there is a clear need for improvements in both the quality and efficiency of these evaluations.</p><p>The rise of large language models, such as OpenAI's ChatGPT, has led to an increase in the use of these in research. While challenges such as authorship disputes and data fabrication have arisen, these tools show great promise when used appropriately [<span>10</span>]. Two previous studies have evaluated the performance of ChatGPT in risk of bias assessments [<span>11, 12</span>]. One focused on ROBINS-I tool and found rather low agreement in it [<span>11</span>]. Another small study focused on risk of bias (RoB) 2.0 tool, and concluded that currently ChatGPT should not be used, but further studies would be needed [<span>12</span>]. The aim of our current study was to evaluate the performance of the most recent version of OpenAI's large language model ChatGPT-4o in the risk of bias assessment.</p><p>We conducted a systematic assessment of the performance of ChatGPT-4o in Cochranes RoB 2.0 tool analyses. First, we searched PubMed on July 31, 2024 for the most recent 50 meta-analyses published in top-level medical journals (<i>Lancet</i>, <i>JAMA</i> or <i>BMJ</i>). The results were uploaded to Covidence software for a screening process. Then, two authors (IK and OP) screened the reviews and included meta-analyses of interventions, which included only RCTs, and had used Cochrane RoB 2.0 tool as their risk of bias assessment tool. A total of six reviews were included (Figure S1). Then a third author (MV) extracted a total of 100 risk of bias assessments from these included reviews. A fourth author (LN) uploaded these 100 studies in pdf format to ChatGPT-4o with a standardized short prompt which was written to the text field. The prompt was: “Perform a risk of bias analysis according to the Cochrane group RoB2 guidelines for the following article and perform the assessment for the main outcome of the trial. Report results only as high, some concerns, low, no information for domains 1–5 and an overall assessment.” The complete list of the included RCTs and extracted risk of bias assessments and ChatGPT-4o assessments is provided in the Supplementary Material.</p><p>Finally, a fifth author (MU) performed the interrater agreement analyses where we compared ChatGPT-4o performance to the assessments extracted from the published reviews. We calculated weighted Fleiss’ kappa estimates with 95% confidence intervals (CIs). Statistical analyses were made in R version 4.2.1. A protocol was pre-registered to Open Science Framework (10.17605/OSF.IO/J67W4).</p><p>A total of 100 RCTs were included for the analysis. The weighted kappa for the overall risk of bias analysis for the primary outcome was 0.24 (95% CI 0.10 to 0.37). The domain-specific agreement was highest in the domain bias arising from the randomization process (kappa = 0.31, 95% CI 0.10 to 0.50), and lowest in the domain bias in selection of the reported results (kappa = –0.11, 95% CI –0.16 to –0.04). In bias due missing outcome data the kappa was 0.12 (95% CI 0.01 to 0.22), in bias due deviations from the interventions 0.06 (95% CI –0.12 to 0.23), and in bias due to measurement of the outcome –0.03 (–0.06 to 0.00). When comparing the given ratings, ChatGPT-4o labeled only three studies as high risk of bias for a single domain but classified none of the studies to have an overall high risk of bias (Figure 1).</p><p>Our current study found that ChatGPT-4o had slight agreement rate in the overall assessment, and slight agreement for bias due randomization domain, whereas the other domains varied between no agreement to poor agreement. Our results were in line with the previously published smaller report of ChatGPT-4o performance in RoB 2.0 assessment [<span>12</span>]. Compared to that study, we did not perform the risk of bias assessment ourselves. Furthermore, we had eight times more studies included, which increased the validity of our results. Another notable difference was that we used a standardized prompt systematically for all evaluations, which improved the standardization of the assessments.</p><p>The primary limitation of our study was the use of relatively simple prompts, as the performance of large language models was influenced by prompt quality. Nevertheless, this reflected a realistic scenario in which users of ChatGPT-4o might utilize it for critical appraisal without extensive knowledge of prompt engineering or awareness of potential weaknesses. Another limitation was that while we initially intended to compare our findings with assessments from Cochrane reviews. However, most Cochrane reviews still employed the original RoB tool [<span>13</span>], whereas the RoB 2.0 tool was more commonly used in high-impact medical journals, and thus we changed the original plan and extracted the assessments from top journals [<span>5</span>]. Furthermore, a limitation was that the interrater agreements were not reported in the included reviews, and thus comparisons directly between the agreements could not be made.</p><p>Future studies should focus on whether the performance of ChatGPT-4o could be enhanced by better prompting or by providing guidance by examples of risk of bias scenarios. Another focus could be how alternative large language models perform compared to human assessment, as well as to compare how different large language models perform against each other. Furthermore, an interesting idea to test could be whether a combination of humans and different large language models aiming to reach a majority voting consensus on the risk of bias assessments would improve the quality and the agreement.</p><p>We found that the performance of ChatGPT-4o was poor in the risk of bias analyses, and mainly, because it was too positive with the assessments. The findings of this study highlighted that ChatGPT-4o was not suitable for risk of bias assessments with simple prompts.</p><p>The authors declare no conflicts of interest.</p>\",\"PeriodicalId\":16090,\"journal\":{\"name\":\"Journal of Evidence‐Based Medicine\",\"volume\":\"17 4\",\"pages\":\"700-702\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684499/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Evidence‐Based Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/jebm.12662\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Evidence‐Based Medicine","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jebm.12662","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

摘要

系统评价和荟萃分析是证据综合的关键部分,被认为可以提供有关干预有效性的最佳信息[10]。证据综合的一个关键部分是对纳入的研究进行批判性评价[b]。偏倚风险的评估通常使用Cochrane的偏倚风险原始工具或修订后的偏倚风险2.0工具,这两种工具都是随机对照试验(RCTs)的结果特异性工具[3,4]。在证据综合项目中,偏倚风险评估非常耗时。此外,即使在顶级医学期刊和Cochrane综述中,它们也被证明容易受到偏见的影响[6-8]。审稿人之间的协议也有所不同。因此,显然需要改进这些评价的质量和效率。大型语言模型的兴起,如OpenAI的ChatGPT,导致这些模型在研究中的使用增加。虽然出现了作者争议和数据伪造等挑战,但如果使用得当,这些工具显示出巨大的前景。之前有两项研究对ChatGPT在偏倚风险评估中的表现进行了评估[11,12]。其中一个关注的是ROBINS-I工具,发现其一致性相当低。另一项小型研究侧重于风险偏差(RoB) 2.0工具,并得出结论,目前不应该使用ChatGPT,但需要进一步的研究。我们当前研究的目的是评估OpenAI最新版本的大型语言模型chatgpt - 40在偏见风险评估中的表现。我们对chatgpt - 40在Cochranes RoB 2.0工具分析中的表现进行了系统评估。首先,我们在PubMed检索了2024年7月31日发表在顶级医学期刊(Lancet, JAMA或BMJ)上的最近50篇荟萃分析。结果被上传到covid软件进行筛选过程。然后,两位作者(IK和OP)筛选了综述并纳入了干预措施的荟萃分析,其中仅包括rct,并使用Cochrane RoB 2.0工具作为他们的偏倚风险评估工具。共纳入6篇综述(图S1)。然后,第三位作者(MV)从这些纳入的评论中提取了总共100个偏倚风险评估。第四作者(LN)将这100项研究以pdf格式上传至chatgpt - 40,并在文本字段中写入标准化的简短提示。提示是:“根据Cochrane group RoB2指南对以下文章进行偏倚风险分析,并对试验的主要结果进行评估。报告结果仅为高,一些关注,低,没有信息领域1-5和整体评估。”纳入的rct、提取的偏倚风险评估和chatgpt - 40评估的完整列表见补充材料。最后,第五作者(MU)进行了译者间一致性分析,我们将chatgpt - 40的性能与从已发表的评论中提取的评估进行了比较。我们用95%置信区间(ci)计算加权Fleiss kappa估计。在R 4.2.1版本中进行统计分析。协议预注册到Open Science Framework (10.17605/OSF.IO/J67W4)。共纳入100项随机对照试验进行分析。主要结局的总体偏倚风险加权kappa为0.24 (95% CI 0.10 ~ 0.37)。在随机化过程中产生的领域偏差中,特定领域的一致性最高(kappa = 0.31, 95% CI 0.10至0.50),在报告结果的选择中,领域偏差最低(kappa = -0.11, 95% CI -0.16至-0.04)。在缺少结果数据的偏倚中,kappa为0.12 (95% CI 0.01至0.22),在偏离干预措施的偏倚中,kappa为0.06 (95% CI -0.12至0.23),在测量结果的偏倚中,kappa为-0.03(-0.06至0.00)。在比较给定的评分时,chatgpt - 40仅将三个研究标记为单个领域的高偏倚风险,但没有将任何研究分类为总体高偏倚风险(图1)。我们目前的研究发现,chatgpt - 40在总体评估中具有轻微的一致性率,并且由于随机化领域的偏倚具有轻微的一致性,而其他领域在不一致和不一致之间变化。我们的结果与之前发表的关于chatgpt - 40在RoB 2.0评估[12]中的性能的较小报告一致。与该研究相比,我们自己没有进行偏倚风险评估。此外,我们纳入了8倍以上的研究,这增加了我们结果的有效性。另一个显著的区别是,我们对所有的评估系统地使用了一个标准化的提示,这提高了评估的标准化。我们研究的主要限制是使用相对简单的提示,因为大型语言模型的性能受到提示质量的影响。 然而,这反映了一个现实的场景,即chatgpt - 40的用户可能会在没有广泛的即时工程知识或潜在弱点意识的情况下利用它进行关键评估。另一个限制是,虽然我们最初打算将我们的研究结果与Cochrane综述的评估进行比较。然而,大多数Cochrane综述仍然使用原始的RoB工具[13],而RoB 2.0工具更常用于高影响力医学期刊,因此我们改变了原始计划,从顶级期刊[5]中提取评价。此外,一个限制是,在列入的审查中没有报告口译员协议,因此无法直接比较协议。未来的研究应该关注chatgpt - 40是否可以通过更好的提示或通过偏见风险场景示例提供指导来提高性能。另一个焦点可能是可选的大型语言模型与人类评估相比如何执行,以及比较不同的大型语言模型如何相互执行。此外,一个值得测试的有趣想法是,人类和不同的大型语言模型的结合,旨在就偏见评估的风险达成多数投票共识,是否会提高质量和一致性。我们发现chatgpt - 40在偏倚风险分析中的表现很差,主要是因为它对评估过于积极。本研究结果强调,chatgpt - 40不适合简单提示的偏倚风险评估。作者声明无利益冲突。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Evaluating the Performance of ChatGPT-4o in Risk of Bias Assessments

Evaluating the Performance of ChatGPT-4o in Risk of Bias Assessments

Systematic reviews and meta-analyses are a key part of evidence synthesis and are considered to provide the best possible information on intervention effectiveness [1]. A key part of the evidence synthesis is the critical appraisal of the included studies [2]. The risk of bias is typically assessed by using Cochrane's risk of bias original tool or the revised risk of bias 2.0 tool, both of which are outcome specific tools for randomized controlled trials (RCTs) [3, 4]. Risk of bias assessments are time-consuming in evidence synthesis projects [5]. Additionally, they have been shown to be susceptible to biases, even in top-tier medical journals and Cochrane reviews [6-8]. The interrater agreement has also shown to be varying between reviewers [9]. Therefore, there is a clear need for improvements in both the quality and efficiency of these evaluations.

The rise of large language models, such as OpenAI's ChatGPT, has led to an increase in the use of these in research. While challenges such as authorship disputes and data fabrication have arisen, these tools show great promise when used appropriately [10]. Two previous studies have evaluated the performance of ChatGPT in risk of bias assessments [11, 12]. One focused on ROBINS-I tool and found rather low agreement in it [11]. Another small study focused on risk of bias (RoB) 2.0 tool, and concluded that currently ChatGPT should not be used, but further studies would be needed [12]. The aim of our current study was to evaluate the performance of the most recent version of OpenAI's large language model ChatGPT-4o in the risk of bias assessment.

We conducted a systematic assessment of the performance of ChatGPT-4o in Cochranes RoB 2.0 tool analyses. First, we searched PubMed on July 31, 2024 for the most recent 50 meta-analyses published in top-level medical journals (Lancet, JAMA or BMJ). The results were uploaded to Covidence software for a screening process. Then, two authors (IK and OP) screened the reviews and included meta-analyses of interventions, which included only RCTs, and had used Cochrane RoB 2.0 tool as their risk of bias assessment tool. A total of six reviews were included (Figure S1). Then a third author (MV) extracted a total of 100 risk of bias assessments from these included reviews. A fourth author (LN) uploaded these 100 studies in pdf format to ChatGPT-4o with a standardized short prompt which was written to the text field. The prompt was: “Perform a risk of bias analysis according to the Cochrane group RoB2 guidelines for the following article and perform the assessment for the main outcome of the trial. Report results only as high, some concerns, low, no information for domains 1–5 and an overall assessment.” The complete list of the included RCTs and extracted risk of bias assessments and ChatGPT-4o assessments is provided in the Supplementary Material.

Finally, a fifth author (MU) performed the interrater agreement analyses where we compared ChatGPT-4o performance to the assessments extracted from the published reviews. We calculated weighted Fleiss’ kappa estimates with 95% confidence intervals (CIs). Statistical analyses were made in R version 4.2.1. A protocol was pre-registered to Open Science Framework (10.17605/OSF.IO/J67W4).

A total of 100 RCTs were included for the analysis. The weighted kappa for the overall risk of bias analysis for the primary outcome was 0.24 (95% CI 0.10 to 0.37). The domain-specific agreement was highest in the domain bias arising from the randomization process (kappa = 0.31, 95% CI 0.10 to 0.50), and lowest in the domain bias in selection of the reported results (kappa = –0.11, 95% CI –0.16 to –0.04). In bias due missing outcome data the kappa was 0.12 (95% CI 0.01 to 0.22), in bias due deviations from the interventions 0.06 (95% CI –0.12 to 0.23), and in bias due to measurement of the outcome –0.03 (–0.06 to 0.00). When comparing the given ratings, ChatGPT-4o labeled only three studies as high risk of bias for a single domain but classified none of the studies to have an overall high risk of bias (Figure 1).

Our current study found that ChatGPT-4o had slight agreement rate in the overall assessment, and slight agreement for bias due randomization domain, whereas the other domains varied between no agreement to poor agreement. Our results were in line with the previously published smaller report of ChatGPT-4o performance in RoB 2.0 assessment [12]. Compared to that study, we did not perform the risk of bias assessment ourselves. Furthermore, we had eight times more studies included, which increased the validity of our results. Another notable difference was that we used a standardized prompt systematically for all evaluations, which improved the standardization of the assessments.

The primary limitation of our study was the use of relatively simple prompts, as the performance of large language models was influenced by prompt quality. Nevertheless, this reflected a realistic scenario in which users of ChatGPT-4o might utilize it for critical appraisal without extensive knowledge of prompt engineering or awareness of potential weaknesses. Another limitation was that while we initially intended to compare our findings with assessments from Cochrane reviews. However, most Cochrane reviews still employed the original RoB tool [13], whereas the RoB 2.0 tool was more commonly used in high-impact medical journals, and thus we changed the original plan and extracted the assessments from top journals [5]. Furthermore, a limitation was that the interrater agreements were not reported in the included reviews, and thus comparisons directly between the agreements could not be made.

Future studies should focus on whether the performance of ChatGPT-4o could be enhanced by better prompting or by providing guidance by examples of risk of bias scenarios. Another focus could be how alternative large language models perform compared to human assessment, as well as to compare how different large language models perform against each other. Furthermore, an interesting idea to test could be whether a combination of humans and different large language models aiming to reach a majority voting consensus on the risk of bias assessments would improve the quality and the agreement.

We found that the performance of ChatGPT-4o was poor in the risk of bias analyses, and mainly, because it was too positive with the assessments. The findings of this study highlighted that ChatGPT-4o was not suitable for risk of bias assessments with simple prompts.

The authors declare no conflicts of interest.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Evidence‐Based Medicine
Journal of Evidence‐Based Medicine MEDICINE, GENERAL & INTERNAL-
CiteScore
11.20
自引率
1.40%
发文量
42
期刊介绍: The Journal of Evidence-Based Medicine (EMB) is an esteemed international healthcare and medical decision-making journal, dedicated to publishing groundbreaking research outcomes in evidence-based decision-making, research, practice, and education. Serving as the official English-language journal of the Cochrane China Centre and West China Hospital of Sichuan University, we eagerly welcome editorials, commentaries, and systematic reviews encompassing various topics such as clinical trials, policy, drug and patient safety, education, and knowledge translation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信