大语言模型帮助上诉被拒绝的放疗服务。

IF 3.3 Q2 ONCOLOGY
Kendall J Kiser, Michael Waters, Jocelyn Reckford, Christopher Lundeberg, Christopher D Abraham
{"title":"大语言模型帮助上诉被拒绝的放疗服务。","authors":"Kendall J Kiser, Michael Waters, Jocelyn Reckford, Christopher Lundeberg, Christopher D Abraham","doi":"10.1200/CCI.24.00129","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language model (LLM) artificial intelligences may help physicians appeal insurer denials of prescribed medical services, a task that delays patient care and contributes to burnout. We evaluated LLM performance at this task for denials of radiotherapy services.</p><p><strong>Methods: </strong>We evaluated generative pretrained transformer 3.5 (GPT-3.5; OpenAI, San Francisco, CA), GPT-4, GPT-4 with internet search functionality (GPT-4web), and GPT-3.5ft. The latter was developed by fine-tuning GPT-3.5 via an OpenAI application programming interface with 53 examples of appeal letters written by radiation oncologists. Twenty test prompts with simulated patient histories were programmatically presented to the LLMs, and output appeal letters were scored by three blinded radiation oncologists for language representation, clinical detail inclusion, clinical reasoning validity, literature citations, and overall readiness for insurer submission.</p><p><strong>Results: </strong>Interobserver agreement between radiation oncologists' scores was moderate or better for all domains (Cohen's kappa coefficients: 0.41-0.91). GPT-3.5, GPT-4, and GPT-4web wrote letters that were on average linguistically clear, summarized provided clinical histories without confabulation, reasoned appropriately, and were scored useful to expedite the insurance appeal process. GPT-4 and GPT-4web letters demonstrated superior clinical reasoning and were readier for submission than GPT-3.5 letters (<i>P</i> < .001). Fine-tuning increased GPT-3.5ft confabulation and compromised performance compared with other LLMs across all domains (<i>P</i> < .001). All LLMs, including GPT-4web, were poor at supporting clinical assertions with existing, relevant, and appropriately cited primary literature.</p><p><strong>Conclusion: </strong>When prompted appropriately, three commercially available LLMs drafted letters that physicians deemed would expedite appealing insurer denials of radiotherapy services. LLMs may decrease this task's clerical workload on providers. However, LLM performance worsened when fine-tuned with a task-specific, small training data set.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large Language Models to Help Appeal Denied Radiotherapy Services.\",\"authors\":\"Kendall J Kiser, Michael Waters, Jocelyn Reckford, Christopher Lundeberg, Christopher D Abraham\",\"doi\":\"10.1200/CCI.24.00129\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Large language model (LLM) artificial intelligences may help physicians appeal insurer denials of prescribed medical services, a task that delays patient care and contributes to burnout. We evaluated LLM performance at this task for denials of radiotherapy services.</p><p><strong>Methods: </strong>We evaluated generative pretrained transformer 3.5 (GPT-3.5; OpenAI, San Francisco, CA), GPT-4, GPT-4 with internet search functionality (GPT-4web), and GPT-3.5ft. The latter was developed by fine-tuning GPT-3.5 via an OpenAI application programming interface with 53 examples of appeal letters written by radiation oncologists. Twenty test prompts with simulated patient histories were programmatically presented to the LLMs, and output appeal letters were scored by three blinded radiation oncologists for language representation, clinical detail inclusion, clinical reasoning validity, literature citations, and overall readiness for insurer submission.</p><p><strong>Results: </strong>Interobserver agreement between radiation oncologists' scores was moderate or better for all domains (Cohen's kappa coefficients: 0.41-0.91). GPT-3.5, GPT-4, and GPT-4web wrote letters that were on average linguistically clear, summarized provided clinical histories without confabulation, reasoned appropriately, and were scored useful to expedite the insurance appeal process. GPT-4 and GPT-4web letters demonstrated superior clinical reasoning and were readier for submission than GPT-3.5 letters (<i>P</i> < .001). Fine-tuning increased GPT-3.5ft confabulation and compromised performance compared with other LLMs across all domains (<i>P</i> < .001). All LLMs, including GPT-4web, were poor at supporting clinical assertions with existing, relevant, and appropriately cited primary literature.</p><p><strong>Conclusion: </strong>When prompted appropriately, three commercially available LLMs drafted letters that physicians deemed would expedite appealing insurer denials of radiotherapy services. LLMs may decrease this task's clerical workload on providers. However, LLM performance worsened when fine-tuned with a task-specific, small training data set.</p>\",\"PeriodicalId\":51626,\"journal\":{\"name\":\"JCO Clinical Cancer Informatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JCO Clinical Cancer Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1200/CCI.24.00129\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI.24.00129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

目的:大语言模型(LLM)人工智能可以帮助医生对保险公司拒绝提供医疗服务的情况提出上诉,这项工作会延误对病人的护理,并导致职业倦怠。我们评估了 LLM 在拒绝放射治疗服务这项任务中的表现:我们评估了生成式预训练转换器 3.5(GPT-3.5;OpenAI,加利福尼亚州旧金山)、GPT-4、具有互联网搜索功能的 GPT-4 (GPT-4web)和 GPT-3.5ft。后者是通过 OpenAI 应用程序编程接口对 GPT-3.5 进行微调后开发的,其中包含 53 个由放射肿瘤专家撰写的呼吁书范例。在程序中向 LLMs 演示了 20 个带有模拟患者病史的测试提示,并由三位双盲放射肿瘤学家对输出的上诉信进行评分,包括语言表达、临床细节包含、临床推理有效性、文献引用和保险公司提交的整体准备情况:放射肿瘤专家的评分在所有领域的观察者间一致性均为中等或更好(科恩卡帕系数:0.41-0.91)。GPT-3.5、GPT-4和GPT-4web撰写的信函平均语言清晰,对所提供的临床病史进行了总结,无混淆,推理恰当,且评分有助于加快保险上诉流程。与 GPT-3.5 相比,GPT-4 和 GPT-4web 信件的临床推理能力更强,更易于提交(P < .001)。与所有领域的其他 LLM 相比,微调增加了 GPT-3.5ft 的混淆性并降低了性能(P < .001)。包括 GPT-4web 在内的所有 LLM 都不善于用现有的、相关的和适当引用的主要文献来支持临床论断:结论:在适当的提示下,三种市售 LLMs 起草了医生认为可以加快对保险公司拒绝放疗服务进行上诉的信件。LLM 可以减轻医疗服务提供者的文书工作量。然而,当使用特定任务的小型训练数据集进行微调时,LLM 的性能会下降。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Large Language Models to Help Appeal Denied Radiotherapy Services.

Purpose: Large language model (LLM) artificial intelligences may help physicians appeal insurer denials of prescribed medical services, a task that delays patient care and contributes to burnout. We evaluated LLM performance at this task for denials of radiotherapy services.

Methods: We evaluated generative pretrained transformer 3.5 (GPT-3.5; OpenAI, San Francisco, CA), GPT-4, GPT-4 with internet search functionality (GPT-4web), and GPT-3.5ft. The latter was developed by fine-tuning GPT-3.5 via an OpenAI application programming interface with 53 examples of appeal letters written by radiation oncologists. Twenty test prompts with simulated patient histories were programmatically presented to the LLMs, and output appeal letters were scored by three blinded radiation oncologists for language representation, clinical detail inclusion, clinical reasoning validity, literature citations, and overall readiness for insurer submission.

Results: Interobserver agreement between radiation oncologists' scores was moderate or better for all domains (Cohen's kappa coefficients: 0.41-0.91). GPT-3.5, GPT-4, and GPT-4web wrote letters that were on average linguistically clear, summarized provided clinical histories without confabulation, reasoned appropriately, and were scored useful to expedite the insurance appeal process. GPT-4 and GPT-4web letters demonstrated superior clinical reasoning and were readier for submission than GPT-3.5 letters (P < .001). Fine-tuning increased GPT-3.5ft confabulation and compromised performance compared with other LLMs across all domains (P < .001). All LLMs, including GPT-4web, were poor at supporting clinical assertions with existing, relevant, and appropriately cited primary literature.

Conclusion: When prompted appropriately, three commercially available LLMs drafted letters that physicians deemed would expedite appealing insurer denials of radiotherapy services. LLMs may decrease this task's clerical workload on providers. However, LLM performance worsened when fine-tuned with a task-specific, small training data set.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
6.20
自引率
4.80%
发文量
190
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信