Ahmed Basharat , Rohan Shah , Nick Wilcox , Gurpaij Tur , Siddarth Tripati , Prisha Kansal , Niveah Gandhi , Sreekrishna Pokuri , Gabby Chong , Charles A. Odonkor , Narayana Varhabhatla , Robert Chow
{"title":"ChatGPT and low back pain - Evaluating AI-driven patient education in the context of interventional pain medicine","authors":"Ahmed Basharat , Rohan Shah , Nick Wilcox , Gurpaij Tur , Siddarth Tripati , Prisha Kansal , Niveah Gandhi , Sreekrishna Pokuri , Gabby Chong , Charles A. Odonkor , Narayana Varhabhatla , Robert Chow","doi":"10.1016/j.inpm.2025.100636","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>ChatGPT and other Large Language Models (LLMs) are not only being more readily integrated into healthcare but are also being utilized more frequently by patients to answer health-related questions. Given the increased utilization for this purpose, it is essential to evaluate and study the consistency and reliability of artificial intelligence (AI) responses. Low back pain (LBP) remains one of the most frequently seen chief complaints in primary care and interventional pain management offices.</div></div><div><h3>Objective</h3><div>This study assesses the readability, accuracy, and overall utility of ChatGPT's ability to address patients' questions concerning low back pain. Our aim is to use clinician feedback to analyze ChatGPT's responses to these common low back pain related questions, as in the future, AI will undoubtedly play a role in triaging patients prior to seeing a physician.</div></div><div><h3>Methods</h3><div>To assess AI responses, we generated a standardized list of 25 questions concerning low back pain that were split into five categories including diagnosis, seeking a medical professional, treatment, self-treatment, and physical therapy. We explored the influence of how a prompt is worded on ChatGPT by asking questions from a 4th grader to a college/reference level. One board certified interventional pain specialist, one interventional pain fellow, and one emergency medicine resident reviewed ChatGPT's generated answers to assess accuracy and clinical utility. Readability and comprehensibility were evaluated using the Flesch-Kincaid Grade Level Scale. Statistical analysis was performed to analyze differences in readability scores, word count, and response complexity.</div></div><div><h3>Results</h3><div>How a question is phrased influences accuracy in statistically significant ways. Over-simplification of queries (e.g. to a 4th grade level) degrades ChatGPT's ability to return clinically complete responses. In contrast, reference and neutral queries preserve accuracy without additional engineering. Regardless of how the question is phrased, ChatGPT's default register trends towards technical language. Readability remains substantially misaligned with health literacy standards. Verbosity correlates with prompt type, but not necessarily accuracy. Word count is an unreliable proxy for informational completeness or clinical correctness in AI outputs and most errors stem from omission, not commission. Importantly, ChatGPT does not frequently generate false claims.</div></div><div><h3>Conclusion</h3><div>This analysis complicates the assumption that “simpler is better” in prompting LLMs for clinical education. Whereas earlier work in structured conditions suggested that plain-language prompts improved accuracy, our findings indicate that a moderate reading level, not maximal simplicity, yields the most reliable outputs in complex domains like pain. This study further supports that AI LLMs can be integrated into a clinical workflow, possibly through electronic health record (EHR) software.</div></div>","PeriodicalId":100727,"journal":{"name":"Interventional Pain Medicine","volume":"4 3","pages":"Article 100636"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interventional Pain Medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772594425000974","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background
ChatGPT and other Large Language Models (LLMs) are not only being more readily integrated into healthcare but are also being utilized more frequently by patients to answer health-related questions. Given the increased utilization for this purpose, it is essential to evaluate and study the consistency and reliability of artificial intelligence (AI) responses. Low back pain (LBP) remains one of the most frequently seen chief complaints in primary care and interventional pain management offices.
Objective
This study assesses the readability, accuracy, and overall utility of ChatGPT's ability to address patients' questions concerning low back pain. Our aim is to use clinician feedback to analyze ChatGPT's responses to these common low back pain related questions, as in the future, AI will undoubtedly play a role in triaging patients prior to seeing a physician.
Methods
To assess AI responses, we generated a standardized list of 25 questions concerning low back pain that were split into five categories including diagnosis, seeking a medical professional, treatment, self-treatment, and physical therapy. We explored the influence of how a prompt is worded on ChatGPT by asking questions from a 4th grader to a college/reference level. One board certified interventional pain specialist, one interventional pain fellow, and one emergency medicine resident reviewed ChatGPT's generated answers to assess accuracy and clinical utility. Readability and comprehensibility were evaluated using the Flesch-Kincaid Grade Level Scale. Statistical analysis was performed to analyze differences in readability scores, word count, and response complexity.
Results
How a question is phrased influences accuracy in statistically significant ways. Over-simplification of queries (e.g. to a 4th grade level) degrades ChatGPT's ability to return clinically complete responses. In contrast, reference and neutral queries preserve accuracy without additional engineering. Regardless of how the question is phrased, ChatGPT's default register trends towards technical language. Readability remains substantially misaligned with health literacy standards. Verbosity correlates with prompt type, but not necessarily accuracy. Word count is an unreliable proxy for informational completeness or clinical correctness in AI outputs and most errors stem from omission, not commission. Importantly, ChatGPT does not frequently generate false claims.
Conclusion
This analysis complicates the assumption that “simpler is better” in prompting LLMs for clinical education. Whereas earlier work in structured conditions suggested that plain-language prompts improved accuracy, our findings indicate that a moderate reading level, not maximal simplicity, yields the most reliable outputs in complex domains like pain. This study further supports that AI LLMs can be integrated into a clinical workflow, possibly through electronic health record (EHR) software.