Can LLMs Predict Patient Treatment Choices? A Discrete Choice Experiment Framework.

IF 6 2区医学 Q1 ECONOMICS

Value in Health Pub Date : 2026-05-05 DOI:10.1016/j.jval.2026.04.006

Tina Cheng, Juan Marcos Gonzalez, Matthew M Engelhard, Shelby Reed, Semra Ozdemir

{"title":"Can LLMs Predict Patient Treatment Choices? A Discrete Choice Experiment Framework.","authors":"Tina Cheng, Juan Marcos Gonzalez, Matthew M Engelhard, Shelby Reed, Semra Ozdemir","doi":"10.1016/j.jval.2026.04.006","DOIUrl":null,"url":null,"abstract":"Objectives: This study evaluated the viability of large-language-models (LLMs), specifically GPT-4, in predicting patient health preference-consistent choices using a discrete choice experiment (DCE) framework.Methods: Synthetic data were generated from real DCE responses by patients with a history of cancer. Analytical data included 50 synthetic patients, each answering 48 two-alternative treatment choice questions varying in expected survival, chance of long-term survival, health limitations, and out-of-pocket cost. GPT-4's predictive performance was assessed across four experiments. In Experiments 1 and 2, GPT-4 predicted 20 hold-out questions (i.e., new choice questions) leveraging 28 fixed (Experiment 1) or random selected (Experiment 2) sample questions. Experiment 3 varied the number of sample questions to examine prediction accuracy and prediction confidence. Experiment 4 evaluated how characteristics of the hold-out questions influenced prediction accuracy.Results: GPT-4 achieved an average prediction accuracy of 70.5% (95% confidence interval [CI]: 68.3%-72.7%) in Experiment 1 and 69.9% (95% CI: 66.9%-72.9%) in Experiment 2, with greater variability when sample questions were randomized. Experiment 3 revealed a learning curve, where accuracy improved from 53% with 5 sample questions to 64% with 10, after which performance plateaued. Experiment 4 showed higher prediction accuracy for questions with more salient attribute differences.Conclusions: GPT-4 demonstrated the ability to infer patient preferences from limited samples, achieving accuracy levels comparable to surrogate decision-makers. Its performance remained consistent across randomized input sequences and improved as the number of sample questions increased, eventually reaching a plateau where additional training yielded diminishing returns.","PeriodicalId":23508,"journal":{"name":"Value in Health","volume":" ","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Value in Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jval.2026.04.006","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: This study evaluated the viability of large-language-models (LLMs), specifically GPT-4, in predicting patient health preference-consistent choices using a discrete choice experiment (DCE) framework.

Methods: Synthetic data were generated from real DCE responses by patients with a history of cancer. Analytical data included 50 synthetic patients, each answering 48 two-alternative treatment choice questions varying in expected survival, chance of long-term survival, health limitations, and out-of-pocket cost. GPT-4's predictive performance was assessed across four experiments. In Experiments 1 and 2, GPT-4 predicted 20 hold-out questions (i.e., new choice questions) leveraging 28 fixed (Experiment 1) or random selected (Experiment 2) sample questions. Experiment 3 varied the number of sample questions to examine prediction accuracy and prediction confidence. Experiment 4 evaluated how characteristics of the hold-out questions influenced prediction accuracy.

Results: GPT-4 achieved an average prediction accuracy of 70.5% (95% confidence interval [CI]: 68.3%-72.7%) in Experiment 1 and 69.9% (95% CI: 66.9%-72.9%) in Experiment 2, with greater variability when sample questions were randomized. Experiment 3 revealed a learning curve, where accuracy improved from 53% with 5 sample questions to 64% with 10, after which performance plateaued. Experiment 4 showed higher prediction accuracy for questions with more salient attribute differences.

Conclusions: GPT-4 demonstrated the ability to infer patient preferences from limited samples, achieving accuracy levels comparable to surrogate decision-makers. Its performance remained consistent across randomized input sequences and improved as the number of sample questions increased, eventually reaching a plateau where additional training yielded diminishing returns.

查看原文本刊更多论文

法学硕士能预测患者的治疗选择吗？离散选择实验框架。

目的：本研究评估了使用离散选择实验（DCE）框架预测患者健康偏好一致选择的大语言模型（LLMs），特别是GPT-4的可行性。方法：根据有癌症病史的患者的真实DCE反应生成合成数据。分析数据包括50名合成患者，每位患者回答48个两种治疗选择问题，这些问题包括预期生存期、长期生存机会、健康限制和自付费用。GPT-4的预测性能通过四个实验进行评估。在实验1和2中，GPT-4利用28个固定（实验1）或随机选择（实验2）样本问题预测了20个保留问题（即新选择问题）。实验3通过改变样本问题的数量来检验预测的准确性和预测的置信度。实验4评估了保留问题的特征对预测准确性的影响。结果：GPT-4在实验1中的平均预测准确率为70.5%(95%置信区间[CI]: 68.3% ~ 72.7%)，在实验2中的平均预测准确率为69.9%(95%置信区间[CI]: 66.9% ~ 72.9%)，在样本问题随机化时差异更大。实验3显示了一个学习曲线，准确率从5个样本问题的53%提高到10个样本问题的64%，之后表现趋于平稳。实验4显示属性差异越显著的问题预测准确率越高。结论：GPT-4证明了从有限的样本推断患者偏好的能力，达到了与替代决策者相当的准确性水平。它的性能在随机输入序列中保持一致，并随着样本问题数量的增加而提高，最终达到一个平台，在这个平台上，额外的训练产生的回报递减。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Value in Health 医学-卫生保健

CiteScore

6.90

自引率

6.70%

发文量

3064

审稿时长

3-8 weeks

期刊介绍： Value in Health contains original research articles for pharmacoeconomics, health economics, and outcomes research (clinical, economic, and patient-reported outcomes/preference-based research), as well as conceptual and health policy articles that provide valuable information for health care decision-makers as well as the research community. As the official journal of ISPOR, Value in Health provides a forum for researchers, as well as health care decision-makers to translate outcomes research into health care decisions.