ChatGPT and low back pain - Evaluating AI-driven patient education in the context of interventional pain medicine

Ahmed Basharat , Rohan Shah , Nick Wilcox , Gurpaij Tur , Siddarth Tripati , Prisha Kansal , Niveah Gandhi , Sreekrishna Pokuri , Gabby Chong , Charles A. Odonkor , Narayana Varhabhatla , Robert Chow
{"title":"ChatGPT and low back pain - Evaluating AI-driven patient education in the context of interventional pain medicine","authors":"Ahmed Basharat ,&nbsp;Rohan Shah ,&nbsp;Nick Wilcox ,&nbsp;Gurpaij Tur ,&nbsp;Siddarth Tripati ,&nbsp;Prisha Kansal ,&nbsp;Niveah Gandhi ,&nbsp;Sreekrishna Pokuri ,&nbsp;Gabby Chong ,&nbsp;Charles A. Odonkor ,&nbsp;Narayana Varhabhatla ,&nbsp;Robert Chow","doi":"10.1016/j.inpm.2025.100636","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>ChatGPT and other Large Language Models (LLMs) are not only being more readily integrated into healthcare but are also being utilized more frequently by patients to answer health-related questions. Given the increased utilization for this purpose, it is essential to evaluate and study the consistency and reliability of artificial intelligence (AI) responses. Low back pain (LBP) remains one of the most frequently seen chief complaints in primary care and interventional pain management offices.</div></div><div><h3>Objective</h3><div>This study assesses the readability, accuracy, and overall utility of ChatGPT's ability to address patients' questions concerning low back pain. Our aim is to use clinician feedback to analyze ChatGPT's responses to these common low back pain related questions, as in the future, AI will undoubtedly play a role in triaging patients prior to seeing a physician.</div></div><div><h3>Methods</h3><div>To assess AI responses, we generated a standardized list of 25 questions concerning low back pain that were split into five categories including diagnosis, seeking a medical professional, treatment, self-treatment, and physical therapy. We explored the influence of how a prompt is worded on ChatGPT by asking questions from a 4th grader to a college/reference level. One board certified interventional pain specialist, one interventional pain fellow, and one emergency medicine resident reviewed ChatGPT's generated answers to assess accuracy and clinical utility. Readability and comprehensibility were evaluated using the Flesch-Kincaid Grade Level Scale. Statistical analysis was performed to analyze differences in readability scores, word count, and response complexity.</div></div><div><h3>Results</h3><div>How a question is phrased influences accuracy in statistically significant ways. Over-simplification of queries (e.g. to a 4th grade level) degrades ChatGPT's ability to return clinically complete responses. In contrast, reference and neutral queries preserve accuracy without additional engineering. Regardless of how the question is phrased, ChatGPT's default register trends towards technical language. Readability remains substantially misaligned with health literacy standards. Verbosity correlates with prompt type, but not necessarily accuracy. Word count is an unreliable proxy for informational completeness or clinical correctness in AI outputs and most errors stem from omission, not commission. Importantly, ChatGPT does not frequently generate false claims.</div></div><div><h3>Conclusion</h3><div>This analysis complicates the assumption that “simpler is better” in prompting LLMs for clinical education. Whereas earlier work in structured conditions suggested that plain-language prompts improved accuracy, our findings indicate that a moderate reading level, not maximal simplicity, yields the most reliable outputs in complex domains like pain. This study further supports that AI LLMs can be integrated into a clinical workflow, possibly through electronic health record (EHR) software.</div></div>","PeriodicalId":100727,"journal":{"name":"Interventional Pain Medicine","volume":"4 3","pages":"Article 100636"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interventional Pain Medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772594425000974","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background

ChatGPT and other Large Language Models (LLMs) are not only being more readily integrated into healthcare but are also being utilized more frequently by patients to answer health-related questions. Given the increased utilization for this purpose, it is essential to evaluate and study the consistency and reliability of artificial intelligence (AI) responses. Low back pain (LBP) remains one of the most frequently seen chief complaints in primary care and interventional pain management offices.

Objective

This study assesses the readability, accuracy, and overall utility of ChatGPT's ability to address patients' questions concerning low back pain. Our aim is to use clinician feedback to analyze ChatGPT's responses to these common low back pain related questions, as in the future, AI will undoubtedly play a role in triaging patients prior to seeing a physician.

Methods

To assess AI responses, we generated a standardized list of 25 questions concerning low back pain that were split into five categories including diagnosis, seeking a medical professional, treatment, self-treatment, and physical therapy. We explored the influence of how a prompt is worded on ChatGPT by asking questions from a 4th grader to a college/reference level. One board certified interventional pain specialist, one interventional pain fellow, and one emergency medicine resident reviewed ChatGPT's generated answers to assess accuracy and clinical utility. Readability and comprehensibility were evaluated using the Flesch-Kincaid Grade Level Scale. Statistical analysis was performed to analyze differences in readability scores, word count, and response complexity.

Results

How a question is phrased influences accuracy in statistically significant ways. Over-simplification of queries (e.g. to a 4th grade level) degrades ChatGPT's ability to return clinically complete responses. In contrast, reference and neutral queries preserve accuracy without additional engineering. Regardless of how the question is phrased, ChatGPT's default register trends towards technical language. Readability remains substantially misaligned with health literacy standards. Verbosity correlates with prompt type, but not necessarily accuracy. Word count is an unreliable proxy for informational completeness or clinical correctness in AI outputs and most errors stem from omission, not commission. Importantly, ChatGPT does not frequently generate false claims.

Conclusion

This analysis complicates the assumption that “simpler is better” in prompting LLMs for clinical education. Whereas earlier work in structured conditions suggested that plain-language prompts improved accuracy, our findings indicate that a moderate reading level, not maximal simplicity, yields the most reliable outputs in complex domains like pain. This study further supports that AI LLMs can be integrated into a clinical workflow, possibly through electronic health record (EHR) software.
ChatGPT和腰痛——在介入性疼痛医学背景下评估人工智能驱动的患者教育
chatgpt和其他大型语言模型(llm)不仅更容易集成到医疗保健中,而且还被患者更频繁地用于回答与健康相关的问题。鉴于为此目的而增加的利用率,评估和研究人工智能(AI)响应的一致性和可靠性至关重要。腰痛(LBP)仍然是初级保健和介入性疼痛管理办公室最常见的主诉之一。目的本研究评估ChatGPT解决患者腰痛问题的可读性、准确性和整体效用。我们的目标是利用临床医生的反馈来分析ChatGPT对这些常见腰痛相关问题的反应,因为在未来,人工智能无疑将在看医生之前对患者进行分类。为了评估人工智能的反应,我们生成了一个包含25个关于腰痛问题的标准化列表,这些问题被分为五类,包括诊断、寻求医疗专业人员、治疗、自我治疗和物理治疗。我们通过询问从四年级学生到大学/参考水平的问题,探索了在ChatGPT上提示措辞的影响。一位委员会认证的介入性疼痛专家、一位介入性疼痛研究员和一位急诊医师审查了ChatGPT生成的答案,以评估其准确性和临床实用性。可读性和可理解性采用Flesch-Kincaid等级量表进行评价。进行统计分析,分析可读性评分、字数和反应复杂性的差异。结果问题的措辞对准确性的影响具有统计学意义。查询的过度简化(例如到4级水平)降低了ChatGPT返回临床完整回复的能力。相比之下,引用查询和中立查询无需额外的工程就能保持准确性。不管问题是如何表达的,ChatGPT的默认寄存器都倾向于技术语言。可读性仍然与卫生素养标准严重不一致。冗长与提示类型相关,但不一定准确。字数统计是人工智能输出信息完整性或临床正确性的不可靠代理,大多数错误源于遗漏,而不是委托。重要的是,ChatGPT不会经常生成虚假声明。结论该分析使“越简单越好”的假设在促使法学硕士进行临床教育方面变得复杂。尽管早期在结构化条件下的研究表明,简单的语言可以提高准确性,但我们的研究结果表明,中等阅读水平,而不是最大的简单性,在疼痛等复杂领域产生最可靠的输出。这项研究进一步支持人工智能法学硕士可以通过电子健康记录(EHR)软件集成到临床工作流程中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信