Using Large Language Models to Address Contextual Questions in Systematic Reviews

Cochrane Evidence Synthesis and Methods Pub Date : 2026-02-26 DOI:10.1002/cesm.70060

Susanne Hempel, Kimny Sysawang, Haley K. Holmer, Erin Tokutomi, Suchitra Iyer, Zhen Wang, Edi Kuhn, Mohammad Hassan Murad

{"title":"Using Large Language Models to Address Contextual Questions in Systematic Reviews","authors":"Susanne Hempel, Kimny Sysawang, Haley K. Holmer, Erin Tokutomi, Suchitra Iyer, Zhen Wang, Edi Kuhn, Mohammad Hassan Murad","doi":"10.1002/cesm.70060","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Objectives</h3>\n \n <p>Systematic evidence reviews (SERs) produced by the U.S. Agency for Healthcare Research and Quality (AHRQ) Evidence-based Practice Center (EPC) Program use contextual questions to provide context and background information on the topic. There is currently no standardized approach to address contextual questions in systematic reviews. This study explored the use of publicly available large language models (LLMs) in addressing contextual questions.</p>\n </section>\n \n <section>\n \n <h3> Study Design</h3>\n \n <p>Using a set of 20 published and 5 yet to be published SERs, we selected one contextual question per report and used it as a prompt to elicit answers from an LLM (ChatGPT, Bard, Claude, or Perplexity). Two independent reviewers rated the results using a priori established evaluation criteria (https://osf.io/4k3cu/), comparing the response in the SER to LLM-generated responses. The study was guided by six research questions addressing feasibility, validity of content, validity of structure, mistakes, congruence between responses, and incremental validity of using LLMs to address contextual questions.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Using minimal prompt engineering produced relevant responses and documented the feasibility of LLM-generated answers to contextual questions. Responses differed in content and format and are not reproducible (e.g., LLMs update regularly), but LLMs were able to produce articulate, clinically plausible, and well-structured responses. We detected few factual errors, contradictions, and no instance of suspected bias, but citations supporting LLM-generated responses could often not be produced or could not be verified (‘confabulations’). Congruence with human generated responses varied, with LLM-generated responses providing more background on the topic and SERs providing more nuanced answers in response to the contextual question. Results regarding incremental validity were mixed and may depend on the tool.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>LLMs are potentially helpful in addressing contextual questions in systematic reviews but human expertise remains essential for using the generated information in a meaningful way.</p>\n </section>\n </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"4 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2026-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12948247/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.70060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives

Systematic evidence reviews (SERs) produced by the U.S. Agency for Healthcare Research and Quality (AHRQ) Evidence-based Practice Center (EPC) Program use contextual questions to provide context and background information on the topic. There is currently no standardized approach to address contextual questions in systematic reviews. This study explored the use of publicly available large language models (LLMs) in addressing contextual questions.

Study Design

Using a set of 20 published and 5 yet to be published SERs, we selected one contextual question per report and used it as a prompt to elicit answers from an LLM (ChatGPT, Bard, Claude, or Perplexity). Two independent reviewers rated the results using a priori established evaluation criteria (https://osf.io/4k3cu/), comparing the response in the SER to LLM-generated responses. The study was guided by six research questions addressing feasibility, validity of content, validity of structure, mistakes, congruence between responses, and incremental validity of using LLMs to address contextual questions.

Results

Using minimal prompt engineering produced relevant responses and documented the feasibility of LLM-generated answers to contextual questions. Responses differed in content and format and are not reproducible (e.g., LLMs update regularly), but LLMs were able to produce articulate, clinically plausible, and well-structured responses. We detected few factual errors, contradictions, and no instance of suspected bias, but citations supporting LLM-generated responses could often not be produced or could not be verified (‘confabulations’). Congruence with human generated responses varied, with LLM-generated responses providing more background on the topic and SERs providing more nuanced answers in response to the contextual question. Results regarding incremental validity were mixed and may depend on the tool.

Conclusion

LLMs are potentially helpful in addressing contextual questions in systematic reviews but human expertise remains essential for using the generated information in a meaningful way.

查看原文本刊更多论文

使用大型语言模型来解决系统评论中的上下文问题。

目的：由美国卫生保健研究和质量机构（AHRQ）循证实践中心（EPC）项目制作的系统证据审查（SERs）使用上下文问题来提供有关该主题的上下文和背景信息。目前还没有标准化的方法来解决系统评价中的上下文问题。本研究探索了在处理上下文问题时使用公开可用的大型语言模型（llm）。研究设计：使用一组20个已发表的和5个尚未发表的SERs，我们在每个报告中选择一个上下文问题，并将其用作提示，以引出法学硕士（ChatGPT, Bard， Claude或Perplexity）的答案。两名独立审稿人使用先验建立的评估标准（https://osf）对结果进行评级。io/4k3cu/)，将SER中的响应与llm生成的响应进行比较。本研究以六个研究问题为指导，包括可行性、内容效度、结构效度、错误、回答一致性和使用法学硕士解决上下文问题的增量效度。结果：使用最小的提示工程产生了相关的响应，并记录了法学硕士生成的上下文问题答案的可行性。响应在内容和格式上有所不同，并且不可重复（例如，法学硕士定期更新），但法学硕士能够产生清晰，临床合理且结构良好的响应。我们几乎没有发现事实错误、矛盾，也没有发现可疑的偏见，但支持法学硕士生成的回应的引用通常无法产生或无法验证（“虚构”）。与人类生成的响应的一致性各不相同，法学硕士生成的响应提供了更多关于主题的背景，而SERs提供了更细致的答案来响应上下文问题。关于增量效度的结果是混合的，可能取决于工具。结论：法学硕士可能有助于解决系统评价中的上下文问题，但人类专业知识对于以有意义的方式使用生成的信息仍然至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cochrane Evidence Synthesis and Methods

自引率

0.00%

发文量