Zero- and few-shot prompting of generative large language models provides weak assessment of risk of bias in clinical trials

IF 6.1 2区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Research Synthesis Methods Pub Date : 2024-08-23 DOI:10.1002/jrsm.1749

Simon Šuster, Timothy Baldwin, Karin Verspoor

{"title":"Zero- and few-shot prompting of generative large language models provides weak assessment of risk of bias in clinical trials","authors":"Simon Šuster, Timothy Baldwin, Karin Verspoor","doi":"10.1002/jrsm.1749","DOIUrl":null,"url":null,"abstract":"Existing systems for automating the assessment of risk-of-bias (RoB) in medical studies are supervised approaches that require substantial training data to work well. However, recent revisions to RoB guidelines have resulted in a scarcity of available training data. In this study, we investigate the effectiveness of generative large language models (LLMs) for assessing RoB. Their application requires little or no training data and, if successful, could serve as a valuable tool to assist human experts during the construction of systematic reviews. Following Cochrane's latest guidelines (RoB2) designed for human reviewers, we prepare instructions that are fed as input to LLMs, which then infer the risk associated with a trial publication. We distinguish between two modelling tasks: directly predicting RoB2 from text; and employing decomposition, in which a RoB2 decision is made after the LLM responds to a series of signalling questions. We curate new testing data sets and evaluate the performance of four general- and medical-domain LLMs. The results fall short of expectations, with LLMs seldom surpassing trivial baselines. On the direct RoB2 prediction test set (n = 5993), LLMs perform akin to the baselines (F1: 0.1–0.2). In the decomposition task setup (n = 28,150), similar F1 scores are observed. Our additional comparative evaluation on RoB1 data also reveals results substantially below those of a supervised system. This testifies to the difficulty of solving this task based on (complex) instructions alone. Using LLMs as an assisting technology for assessing RoB2 thus currently seems beyond their reach.","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 6","pages":"988-1000"},"PeriodicalIF":6.1000,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jrsm.1749","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Synthesis Methods","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jrsm.1749","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Existing systems for automating the assessment of risk-of-bias (RoB) in medical studies are supervised approaches that require substantial training data to work well. However, recent revisions to RoB guidelines have resulted in a scarcity of available training data. In this study, we investigate the effectiveness of generative large language models (LLMs) for assessing RoB. Their application requires little or no training data and, if successful, could serve as a valuable tool to assist human experts during the construction of systematic reviews. Following Cochrane's latest guidelines (RoB2) designed for human reviewers, we prepare instructions that are fed as input to LLMs, which then infer the risk associated with a trial publication. We distinguish between two modelling tasks: directly predicting RoB2 from text; and employing decomposition, in which a RoB2 decision is made after the LLM responds to a series of signalling questions. We curate new testing data sets and evaluate the performance of four general- and medical-domain LLMs. The results fall short of expectations, with LLMs seldom surpassing trivial baselines. On the direct RoB2 prediction test set (n = 5993), LLMs perform akin to the baselines (F1: 0.1–0.2). In the decomposition task setup (n = 28,150), similar F1 scores are observed. Our additional comparative evaluation on RoB1 data also reveals results substantially below those of a supervised system. This testifies to the difficulty of solving this task based on (complex) instructions alone. Using LLMs as an assisting technology for assessing RoB2 thus currently seems beyond their reach.

Abstract Image

查看原文本刊更多论文

生成式大语言模型的零次和少量提示可对临床试验中的偏差风险进行微弱评估。

现有的医学研究偏倚风险（RoB）自动评估系统是一种监督式方法，需要大量的训练数据才能正常工作。然而，最近对 RoB 指南的修订导致可用训练数据的匮乏。在本研究中，我们研究了生成式大型语言模型（LLM）在评估 RoB 方面的有效性。它们的应用只需要很少或不需要训练数据，如果成功的话，可以作为一种有价值的工具，在构建系统综述的过程中为人类专家提供帮助。根据科克伦为人类审稿人设计的最新指南（RoB2），我们准备了一些说明，作为 LLM 的输入，然后由 LLM 推断与试验出版物相关的风险。我们区分了两种建模任务：从文本中直接预测 RoB2；以及采用分解法，在 LLM 回答一系列信号问题后做出 RoB2 决定。我们策划了新的测试数据集，并评估了四种通用和医疗领域 LLM 的性能。结果没有达到预期，LLM 很少超过微不足道的基线。在直接的 RoB2 预测测试集（n = 5993）上，LLM 的表现与基线相似（F1：0.1-0.2）。在分解任务设置（n = 28,150）中，也观察到了类似的 F1 分数。我们在 RoB1 数据上进行的额外比较评估也显示，结果大大低于监督系统。这证明了仅凭（复杂的）指令来解决这一任务的难度。因此，使用 LLM 作为评估 RoB2 的辅助技术目前似乎还力不从心。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Research Synthesis Methods MATHEMATICAL & COMPUTATIONAL BIOLOGYMULTID-MULTIDISCIPLINARY SCIENCES

CiteScore

16.90

自引率

3.10%

发文量

期刊介绍： Research Synthesis Methods is a reputable, peer-reviewed journal that focuses on the development and dissemination of methods for conducting systematic research synthesis. Our aim is to advance the knowledge and application of research synthesis methods across various disciplines. Our journal provides a platform for the exchange of ideas and knowledge related to designing, conducting, analyzing, interpreting, reporting, and applying research synthesis. While research synthesis is commonly practiced in the health and social sciences, our journal also welcomes contributions from other fields to enrich the methodologies employed in research synthesis across scientific disciplines. By bridging different disciplines, we aim to foster collaboration and cross-fertilization of ideas, ultimately enhancing the quality and effectiveness of research synthesis methods. Whether you are a researcher, practitioner, or stakeholder involved in research synthesis, our journal strives to offer valuable insights and practical guidance for your work.