Referential hallucination and clinical reliability in large language models: a comparative analysis using regenerative medicine guidelines for chronic pain.

IF 2.9 3区医学 Q2 RHEUMATOLOGY

Rheumatology International Pub Date : 2025-09-27 DOI:10.1007/s00296-025-05996-z

Ozlem Kuculmez, Ahmet Usen, Emine Dündar Ahi

{"title":"Referential hallucination and clinical reliability in large language models: a comparative analysis using regenerative medicine guidelines for chronic pain.","authors":"Ozlem Kuculmez, Ahmet Usen, Emine Dündar Ahi","doi":"10.1007/s00296-025-05996-z","DOIUrl":null,"url":null,"abstract":"<p><p>This study compared language models' responses to open-ended questions on regenerative therapy guidelines for chronic pain, assessing their accuracy, reliability, usefulness, readability, semantic similarity, and hallucination rates. This cross-sectional study used 16 open-ended questions based on the American Society of Pain and Neuroscience's regenerative therapy guidelines for chronic pain. Questions were answered by ChatGPT-4o, Gemini 2.5 Flash, and Claude 4 Opus. Responses were rated on a 7-point Likert scale for usability and reliability, and a 5-point scale for accuracy. Hallucinogenicity, readability (FKRE, FKGL), and similarity (USE, ROUGE-L) were also assessed. Statistical comparisons were made, with significance set at p < 0.05. Claude Opus 4 showed the highest reliability (5.19 ± 1.11), usefulness (5.06 ± 1.0), and clinical accuracy (4.06 ± 0.68), outperforming ChatGPT-4o (4.13 ± 0.96; 3.94 ± 0.85; 3.38 ± 0.72) and Gemini 2.5 (4.19 ± 0.98; 4.06 ± 0.93; 3.38 ± 0.62). Claude had the lowest reference hallucinations (RHS 4.44 ± 3.18) vs. ChatGPT-4o (8.38 ± 1.86) and Gemini 2.5 (8.75 ± 1.73). In semantic similarity, Claude (0.68 ± 0.08) and Gemini (0.65 ± 0.07) surpassed ChatGPT-4o (0.60 ± 0.09). Gemini led in ROUGE-L F1 (0.12 ± 0.03) vs. Claude (0.10 ± 0.02) and ChatGPT-4o (0.07 ± 0.03). Readability was similar, though Gemini had a higher FKGL (11.3 ± 1.06) than Claude (10.3 ± 2.09). Claude Opus 4 showed superior accuracy, reliability, and usefulness, with significantly fewer hallucinations. Readability scores were similar across models. Further research is recommended.</p>","PeriodicalId":21322,"journal":{"name":"Rheumatology International","volume":"45 10","pages":"240"},"PeriodicalIF":2.9000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Rheumatology International","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00296-025-05996-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

This study compared language models' responses to open-ended questions on regenerative therapy guidelines for chronic pain, assessing their accuracy, reliability, usefulness, readability, semantic similarity, and hallucination rates. This cross-sectional study used 16 open-ended questions based on the American Society of Pain and Neuroscience's regenerative therapy guidelines for chronic pain. Questions were answered by ChatGPT-4o, Gemini 2.5 Flash, and Claude 4 Opus. Responses were rated on a 7-point Likert scale for usability and reliability, and a 5-point scale for accuracy. Hallucinogenicity, readability (FKRE, FKGL), and similarity (USE, ROUGE-L) were also assessed. Statistical comparisons were made, with significance set at p < 0.05. Claude Opus 4 showed the highest reliability (5.19 ± 1.11), usefulness (5.06 ± 1.0), and clinical accuracy (4.06 ± 0.68), outperforming ChatGPT-4o (4.13 ± 0.96; 3.94 ± 0.85; 3.38 ± 0.72) and Gemini 2.5 (4.19 ± 0.98; 4.06 ± 0.93; 3.38 ± 0.62). Claude had the lowest reference hallucinations (RHS 4.44 ± 3.18) vs. ChatGPT-4o (8.38 ± 1.86) and Gemini 2.5 (8.75 ± 1.73). In semantic similarity, Claude (0.68 ± 0.08) and Gemini (0.65 ± 0.07) surpassed ChatGPT-4o (0.60 ± 0.09). Gemini led in ROUGE-L F1 (0.12 ± 0.03) vs. Claude (0.10 ± 0.02) and ChatGPT-4o (0.07 ± 0.03). Readability was similar, though Gemini had a higher FKGL (11.3 ± 1.06) than Claude (10.3 ± 2.09). Claude Opus 4 showed superior accuracy, reliability, and usefulness, with significantly fewer hallucinations. Readability scores were similar across models. Further research is recommended.

查看原文本刊更多论文

参考幻觉和大语言模型的临床可靠性：使用再生医学指南治疗慢性疼痛的比较分析。

本研究比较了语言模型对慢性疼痛再生治疗指南开放性问题的回答，评估了它们的准确性、可靠性、实用性、可读性、语义相似性和幻觉率。这项横断面研究使用了16个开放式问题，这些问题基于美国疼痛和神经科学学会关于慢性疼痛的再生治疗指南。chatgpt - 40、Gemini 2.5 Flash和Claude 4 Opus回答了问题。回答的可用性和可靠性以7分的李克特量表打分，准确性以5分的量表打分。还对致幻性、可读性（FKRE、FKGL）和相似性（USE、ROUGE-L）进行了评估。进行统计学比较，显著性设为p

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Rheumatology International 医学-风湿病学

CiteScore

7.30

自引率

5.00%

发文量

191

审稿时长

16. months

期刊介绍： RHEUMATOLOGY INTERNATIONAL is an independent journal reflecting world-wide progress in the research, diagnosis and treatment of the various rheumatic diseases. It is designed to serve researchers and clinicians in the field of rheumatology. RHEUMATOLOGY INTERNATIONAL will cover all modern trends in clinical research as well as in the management of rheumatic diseases. Special emphasis will be given to public health issues related to rheumatic diseases, applying rheumatology research to clinical practice, epidemiology of rheumatic diseases, diagnostic tests for rheumatic diseases, patient reported outcomes (PROs) in rheumatology and evidence on education of rheumatology. Contributions to these topics will appear in the form of original publications, short communications, editorials, and reviews. "Letters to the editor" will be welcome as an enhancement to discussion. Basic science research, including in vitro or animal studies, is discouraged to submit, as we will only review studies on humans with an epidemological or clinical perspective. Case reports without a proper review of the literatura (Case-based Reviews) will not be published. Every effort will be made to ensure speed of publication while maintaining a high standard of contents and production. Manuscripts submitted for publication must contain a statement to the effect that all human studies have been reviewed by the appropriate ethics committee and have therefore been performed in accordance with the ethical standards laid down in an appropriate version of the 1964 Declaration of Helsinki. It should also be stated clearly in the text that all persons gave their informed consent prior to their inclusion in the study. Details that might disclose the identity of the subjects under study should be omitted.