Tina Cheng, Juan Marcos Gonzalez, Matthew M Engelhard, Shelby Reed, Semra Ozdemir
{"title":"Can LLMs Predict Patient Treatment Choices? A Discrete Choice Experiment Framework.","authors":"Tina Cheng, Juan Marcos Gonzalez, Matthew M Engelhard, Shelby Reed, Semra Ozdemir","doi":"10.1016/j.jval.2026.04.006","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>This study evaluated the viability of large-language-models (LLMs), specifically GPT-4, in predicting patient health preference-consistent choices using a discrete choice experiment (DCE) framework.</p><p><strong>Methods: </strong>Synthetic data were generated from real DCE responses by patients with a history of cancer. Analytical data included 50 synthetic patients, each answering 48 two-alternative treatment choice questions varying in expected survival, chance of long-term survival, health limitations, and out-of-pocket cost. GPT-4's predictive performance was assessed across four experiments. In Experiments 1 and 2, GPT-4 predicted 20 hold-out questions (i.e., new choice questions) leveraging 28 fixed (Experiment 1) or random selected (Experiment 2) sample questions. Experiment 3 varied the number of sample questions to examine prediction accuracy and prediction confidence. Experiment 4 evaluated how characteristics of the hold-out questions influenced prediction accuracy.</p><p><strong>Results: </strong>GPT-4 achieved an average prediction accuracy of 70.5% (95% confidence interval [CI]: 68.3%-72.7%) in Experiment 1 and 69.9% (95% CI: 66.9%-72.9%) in Experiment 2, with greater variability when sample questions were randomized. Experiment 3 revealed a learning curve, where accuracy improved from 53% with 5 sample questions to 64% with 10, after which performance plateaued. Experiment 4 showed higher prediction accuracy for questions with more salient attribute differences.</p><p><strong>Conclusions: </strong>GPT-4 demonstrated the ability to infer patient preferences from limited samples, achieving accuracy levels comparable to surrogate decision-makers. Its performance remained consistent across randomized input sequences and improved as the number of sample questions increased, eventually reaching a plateau where additional training yielded diminishing returns.</p>","PeriodicalId":23508,"journal":{"name":"Value in Health","volume":" ","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Value in Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jval.2026.04.006","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: This study evaluated the viability of large-language-models (LLMs), specifically GPT-4, in predicting patient health preference-consistent choices using a discrete choice experiment (DCE) framework.
Methods: Synthetic data were generated from real DCE responses by patients with a history of cancer. Analytical data included 50 synthetic patients, each answering 48 two-alternative treatment choice questions varying in expected survival, chance of long-term survival, health limitations, and out-of-pocket cost. GPT-4's predictive performance was assessed across four experiments. In Experiments 1 and 2, GPT-4 predicted 20 hold-out questions (i.e., new choice questions) leveraging 28 fixed (Experiment 1) or random selected (Experiment 2) sample questions. Experiment 3 varied the number of sample questions to examine prediction accuracy and prediction confidence. Experiment 4 evaluated how characteristics of the hold-out questions influenced prediction accuracy.
Results: GPT-4 achieved an average prediction accuracy of 70.5% (95% confidence interval [CI]: 68.3%-72.7%) in Experiment 1 and 69.9% (95% CI: 66.9%-72.9%) in Experiment 2, with greater variability when sample questions were randomized. Experiment 3 revealed a learning curve, where accuracy improved from 53% with 5 sample questions to 64% with 10, after which performance plateaued. Experiment 4 showed higher prediction accuracy for questions with more salient attribute differences.
Conclusions: GPT-4 demonstrated the ability to infer patient preferences from limited samples, achieving accuracy levels comparable to surrogate decision-makers. Its performance remained consistent across randomized input sequences and improved as the number of sample questions increased, eventually reaching a plateau where additional training yielded diminishing returns.
期刊介绍:
Value in Health contains original research articles for pharmacoeconomics, health economics, and outcomes research (clinical, economic, and patient-reported outcomes/preference-based research), as well as conceptual and health policy articles that provide valuable information for health care decision-makers as well as the research community. As the official journal of ISPOR, Value in Health provides a forum for researchers, as well as health care decision-makers to translate outcomes research into health care decisions.