Sai P. Selvaraj , Renata W. Yen , Rachel Forcino , Glyn Elwyn
{"title":"共享决策的观察者OPTION-5测量自动化:通过比较大型语言模型和人类评分来评估有效性","authors":"Sai P. Selvaraj , Renata W. Yen , Rachel Forcino , Glyn Elwyn","doi":"10.1016/j.pec.2025.109362","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>Observer-based measures of shared decision rely on human raters, it is resource-intensive, limiting routine assessment and improvement. Generative artificial intelligence could increase the speed and accuracy of observer-based evaluation while reducing the burden. This study aimed to assess the performance of large language models (LLMs) from Gemini, GPT, and LLaMA family of models in evaluating the extent of shared decision-making between clinicians and women considering surgery for early-stage breast cancer.</div></div><div><h3>Methods</h3><div>LLM-generated scores were compared with those of trained human raters from a randomized controlled trial using the 5-item Observer OPTION-5 measure. We analyzed 287 anonymized transcripts of breast cancer consultations. A series of prompts were tested across models, assessing correlations with human scores. We also evaluated the ability of LLMs to distinguish high versus low encounters and the impact of inter-rater agreement on performance.<span><span><sup>1</sup></span></span></div></div><div><h3>Results</h3><div>The scores for Observer OPTION-5 items generated by the GPT-4o and Gemini-1.5-Pro-002 correlated with human ratings (Pearson r ≈ 0.6, p-value<0.01), representing ≈ 75–80 % of the correlation observed between human raters themselves (r = 0.77). Providing detailed descriptions and examples improved the models’ performance. The results also confirm that the models could distinguish high- from low-scoring encounters, with an independent-samples t-test showing a large and significant separation between the two groups (t > 10, p < 0.01).</div></div><div><h3>Conclusions</h3><div>Based on the breast cancer surgery dataset we explored, LLMs can evaluate aspects of clinician-patient dialog using existing measures, providing the basis for the development and fine-tuning of prompts. Future work should focus on generalizability, larger datasets, and improving model performance.</div></div><div><h3>Practice implications</h3><div>The prospect of being able to automate the assessment of shared decision-making opens the door to rapid feedback as a means for reflective practice improvement.</div></div>","PeriodicalId":49714,"journal":{"name":"Patient Education and Counseling","volume":"142 ","pages":"Article 109362"},"PeriodicalIF":3.1000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automating the Observer OPTION-5 measure of shared decision making: Assessing validity by comparing large language models to human ratings\",\"authors\":\"Sai P. Selvaraj , Renata W. Yen , Rachel Forcino , Glyn Elwyn\",\"doi\":\"10.1016/j.pec.2025.109362\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objectives</h3><div>Observer-based measures of shared decision rely on human raters, it is resource-intensive, limiting routine assessment and improvement. Generative artificial intelligence could increase the speed and accuracy of observer-based evaluation while reducing the burden. This study aimed to assess the performance of large language models (LLMs) from Gemini, GPT, and LLaMA family of models in evaluating the extent of shared decision-making between clinicians and women considering surgery for early-stage breast cancer.</div></div><div><h3>Methods</h3><div>LLM-generated scores were compared with those of trained human raters from a randomized controlled trial using the 5-item Observer OPTION-5 measure. We analyzed 287 anonymized transcripts of breast cancer consultations. A series of prompts were tested across models, assessing correlations with human scores. We also evaluated the ability of LLMs to distinguish high versus low encounters and the impact of inter-rater agreement on performance.<span><span><sup>1</sup></span></span></div></div><div><h3>Results</h3><div>The scores for Observer OPTION-5 items generated by the GPT-4o and Gemini-1.5-Pro-002 correlated with human ratings (Pearson r ≈ 0.6, p-value<0.01), representing ≈ 75–80 % of the correlation observed between human raters themselves (r = 0.77). Providing detailed descriptions and examples improved the models’ performance. The results also confirm that the models could distinguish high- from low-scoring encounters, with an independent-samples t-test showing a large and significant separation between the two groups (t > 10, p < 0.01).</div></div><div><h3>Conclusions</h3><div>Based on the breast cancer surgery dataset we explored, LLMs can evaluate aspects of clinician-patient dialog using existing measures, providing the basis for the development and fine-tuning of prompts. Future work should focus on generalizability, larger datasets, and improving model performance.</div></div><div><h3>Practice implications</h3><div>The prospect of being able to automate the assessment of shared decision-making opens the door to rapid feedback as a means for reflective practice improvement.</div></div>\",\"PeriodicalId\":49714,\"journal\":{\"name\":\"Patient Education and Counseling\",\"volume\":\"142 \",\"pages\":\"Article 109362\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Patient Education and Counseling\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0738399125007293\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Patient Education and Counseling","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0738399125007293","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
Automating the Observer OPTION-5 measure of shared decision making: Assessing validity by comparing large language models to human ratings
Objectives
Observer-based measures of shared decision rely on human raters, it is resource-intensive, limiting routine assessment and improvement. Generative artificial intelligence could increase the speed and accuracy of observer-based evaluation while reducing the burden. This study aimed to assess the performance of large language models (LLMs) from Gemini, GPT, and LLaMA family of models in evaluating the extent of shared decision-making between clinicians and women considering surgery for early-stage breast cancer.
Methods
LLM-generated scores were compared with those of trained human raters from a randomized controlled trial using the 5-item Observer OPTION-5 measure. We analyzed 287 anonymized transcripts of breast cancer consultations. A series of prompts were tested across models, assessing correlations with human scores. We also evaluated the ability of LLMs to distinguish high versus low encounters and the impact of inter-rater agreement on performance.1
Results
The scores for Observer OPTION-5 items generated by the GPT-4o and Gemini-1.5-Pro-002 correlated with human ratings (Pearson r ≈ 0.6, p-value<0.01), representing ≈ 75–80 % of the correlation observed between human raters themselves (r = 0.77). Providing detailed descriptions and examples improved the models’ performance. The results also confirm that the models could distinguish high- from low-scoring encounters, with an independent-samples t-test showing a large and significant separation between the two groups (t > 10, p < 0.01).
Conclusions
Based on the breast cancer surgery dataset we explored, LLMs can evaluate aspects of clinician-patient dialog using existing measures, providing the basis for the development and fine-tuning of prompts. Future work should focus on generalizability, larger datasets, and improving model performance.
Practice implications
The prospect of being able to automate the assessment of shared decision-making opens the door to rapid feedback as a means for reflective practice improvement.
期刊介绍:
Patient Education and Counseling is an interdisciplinary, international journal for patient education and health promotion researchers, managers and clinicians. The journal seeks to explore and elucidate the educational, counseling and communication models in health care. Its aim is to provide a forum for fundamental as well as applied research, and to promote the study of organizational issues involved with the delivery of patient education, counseling, health promotion services and training models in improving communication between providers and patients.