自动生成用于共享决策制定的比较表：比较人工生成的表（Option Grid）、搜索引擎流程和来自四个大型语言模型的输出

IF 3.1 2区医学 Q2 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH

Patient Education and Counseling Pub Date : 2025-09-19 DOI:10.1016/j.pec.2025.109356

Padhraig Ryan , Glyn Elwyn

{"title":"自动生成用于共享决策制定的比较表：比较人工生成的表（Option Grid）、搜索引擎流程和来自四个大型语言模型的输出","authors":"Padhraig Ryan , Glyn Elwyn","doi":"10.1016/j.pec.2025.109356","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>To explore the ability of artificial intelligence to produce comparison tables to facilitate shared decision-making.</div></div><div><h3>Methods</h3><div>An expert human-generated comparison table (Option Grid ™) was compared to four comparison tables produced by large language models and one produced using a Google search process that a patient might undertake. Each table was prepared for a patient with osteoarthritis of the knee, considering a knee replacement. The results were compared to the Option Grid.™</div><div>The information items in each comparison table were divided into eight categories: the intervention process; benefits; side effects & adverse effects; pre-operative care; post-operative care & physical recovery; repeat surgery; decision-making process; and alternative interventions. We assessed the accuracy of each information item in a binary manner (accurate, inaccurate).</div></div><div><h3>Results</h3><div>OpenBioLLM-70b and two proprietary ChatGPT models generated similar frequencies of information items across most categories, but omitted information on alternative interventions. The Google search process yielded the highest number of information items (n = 41), and OpenBioLLM-8b yielded the lowest (n = 20). Accuracy, compared to the human Option Grid, was 97 % for the ChatGPT models and the open-source OpenBioLLM-70b, and 95 % for OpenBioLLM-8b and the Google search process. The human-generated Option Grid had superior readability.</div></div><div><h3>Conclusions</h3><div>Large language models produced comparison tables that are 3–5 % less accurate than a human generated Option Grid. Comparison tables produced by large language models may be less readable and require additional checking and editing.</div></div><div><h3>Practice implications</h3><div>Subject to fact-checking and feedback, large language models may have a role to play in scaling up the production of evidence-based comparison tables that could assist patients and others.</div></div>","PeriodicalId":49714,"journal":{"name":"Patient Education and Counseling","volume":"142 ","pages":"Article 109356"},"PeriodicalIF":3.1000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automated production of comparison tables for shared decision making: Comparing a human-generated table (Option Grid), a search engine process, and outputs from four large language models\",\"authors\":\"Padhraig Ryan , Glyn Elwyn\",\"doi\":\"10.1016/j.pec.2025.109356\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objectives</h3><div>To explore the ability of artificial intelligence to produce comparison tables to facilitate shared decision-making.</div></div><div><h3>Methods</h3><div>An expert human-generated comparison table (Option Grid ™) was compared to four comparison tables produced by large language models and one produced using a Google search process that a patient might undertake. Each table was prepared for a patient with osteoarthritis of the knee, considering a knee replacement. The results were compared to the Option Grid.™</div><div>The information items in each comparison table were divided into eight categories: the intervention process; benefits; side effects & adverse effects; pre-operative care; post-operative care & physical recovery; repeat surgery; decision-making process; and alternative interventions. We assessed the accuracy of each information item in a binary manner (accurate, inaccurate).</div></div><div><h3>Results</h3><div>OpenBioLLM-70b and two proprietary ChatGPT models generated similar frequencies of information items across most categories, but omitted information on alternative interventions. The Google search process yielded the highest number of information items (n = 41), and OpenBioLLM-8b yielded the lowest (n = 20). Accuracy, compared to the human Option Grid, was 97 % for the ChatGPT models and the open-source OpenBioLLM-70b, and 95 % for OpenBioLLM-8b and the Google search process. The human-generated Option Grid had superior readability.</div></div><div><h3>Conclusions</h3><div>Large language models produced comparison tables that are 3–5 % less accurate than a human generated Option Grid. Comparison tables produced by large language models may be less readable and require additional checking and editing.</div></div><div><h3>Practice implications</h3><div>Subject to fact-checking and feedback, large language models may have a role to play in scaling up the production of evidence-based comparison tables that could assist patients and others.</div></div>\",\"PeriodicalId\":49714,\"journal\":{\"name\":\"Patient Education and Counseling\",\"volume\":\"142 \",\"pages\":\"Article 109356\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Patient Education and Counseling\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0738399125007232\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Patient Education and Counseling","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0738399125007232","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}

引用次数: 0

摘要

目的探讨人工智能生成比较表的能力，促进共同决策。方法将人工生成的专家比较表（Option Grid™）与由大型语言模型生成的四个比较表和使用患者可能进行的谷歌搜索过程生成的一个比较表进行比较。每张桌子都是为考虑膝关节置换术的膝关节骨关节炎患者准备的。结果与选项网格进行了比较。™每个对照表中的信息项分为八个类别：干预过程；福利;副作用&不良反应；术前护理;术后护理及身体恢复；重复手术;决策过程;以及其他干预措施。我们以二元方式评估每个信息项目的准确性（准确，不准确）。结果openbiollm -70b和两个专有的ChatGPT模型在大多数类别中产生相似的信息项目频率，但忽略了替代干预措施的信息。谷歌搜索过程产生的信息项数最多（n = 41），而OpenBioLLM-8b产生的信息项数最少（n = 20）。与人类选择网格相比，ChatGPT模型和开源OpenBioLLM-70b的准确率为97 %，OpenBioLLM-8b和谷歌搜索过程的准确率为95 %。人工生成的Option Grid具有更好的可读性。结论：大型语言模型产生的比较表的准确性比人类生成的选项网格低3-5 %。由大型语言模型生成的比较表可能可读性较差，并且需要额外的检查和编辑。实践意义根据事实核查和反馈，大型语言模型可能在扩大基于证据的比较表的生产方面发挥作用，可以帮助患者和其他人。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automated production of comparison tables for shared decision making: Comparing a human-generated table (Option Grid), a search engine process, and outputs from four large language models

Objectives

To explore the ability of artificial intelligence to produce comparison tables to facilitate shared decision-making.

Methods

An expert human-generated comparison table (Option Grid ™) was compared to four comparison tables produced by large language models and one produced using a Google search process that a patient might undertake. Each table was prepared for a patient with osteoarthritis of the knee, considering a knee replacement. The results were compared to the Option Grid.™

The information items in each comparison table were divided into eight categories: the intervention process; benefits; side effects & adverse effects; pre-operative care; post-operative care & physical recovery; repeat surgery; decision-making process; and alternative interventions. We assessed the accuracy of each information item in a binary manner (accurate, inaccurate).

Results

OpenBioLLM-70b and two proprietary ChatGPT models generated similar frequencies of information items across most categories, but omitted information on alternative interventions. The Google search process yielded the highest number of information items (n = 41), and OpenBioLLM-8b yielded the lowest (n = 20). Accuracy, compared to the human Option Grid, was 97 % for the ChatGPT models and the open-source OpenBioLLM-70b, and 95 % for OpenBioLLM-8b and the Google search process. The human-generated Option Grid had superior readability.

Conclusions

Large language models produced comparison tables that are 3–5 % less accurate than a human generated Option Grid. Comparison tables produced by large language models may be less readable and require additional checking and editing.

Practice implications

Subject to fact-checking and feedback, large language models may have a role to play in scaling up the production of evidence-based comparison tables that could assist patients and others.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Patient Education and Counseling 医学-公共卫生、环境卫生与职业卫生

CiteScore

5.60

自引率

11.40%

发文量

384

审稿时长

46 days

期刊介绍： Patient Education and Counseling is an interdisciplinary, international journal for patient education and health promotion researchers, managers and clinicians. The journal seeks to explore and elucidate the educational, counseling and communication models in health care. Its aim is to provide a forum for fundamental as well as applied research, and to promote the study of organizational issues involved with the delivery of patient education, counseling, health promotion services and training models in improving communication between providers and patients.