Referential hallucination and clinical reliability in large language models: a comparative analysis using regenerative medicine guidelines for chronic pain.
{"title":"Referential hallucination and clinical reliability in large language models: a comparative analysis using regenerative medicine guidelines for chronic pain.","authors":"Ozlem Kuculmez, Ahmet Usen, Emine Dündar Ahi","doi":"10.1007/s00296-025-05996-z","DOIUrl":null,"url":null,"abstract":"<p><p>This study compared language models' responses to open-ended questions on regenerative therapy guidelines for chronic pain, assessing their accuracy, reliability, usefulness, readability, semantic similarity, and hallucination rates. This cross-sectional study used 16 open-ended questions based on the American Society of Pain and Neuroscience's regenerative therapy guidelines for chronic pain. Questions were answered by ChatGPT-4o, Gemini 2.5 Flash, and Claude 4 Opus. Responses were rated on a 7-point Likert scale for usability and reliability, and a 5-point scale for accuracy. Hallucinogenicity, readability (FKRE, FKGL), and similarity (USE, ROUGE-L) were also assessed. Statistical comparisons were made, with significance set at p < 0.05. Claude Opus 4 showed the highest reliability (5.19 ± 1.11), usefulness (5.06 ± 1.0), and clinical accuracy (4.06 ± 0.68), outperforming ChatGPT-4o (4.13 ± 0.96; 3.94 ± 0.85; 3.38 ± 0.72) and Gemini 2.5 (4.19 ± 0.98; 4.06 ± 0.93; 3.38 ± 0.62). Claude had the lowest reference hallucinations (RHS 4.44 ± 3.18) vs. ChatGPT-4o (8.38 ± 1.86) and Gemini 2.5 (8.75 ± 1.73). In semantic similarity, Claude (0.68 ± 0.08) and Gemini (0.65 ± 0.07) surpassed ChatGPT-4o (0.60 ± 0.09). Gemini led in ROUGE-L F1 (0.12 ± 0.03) vs. Claude (0.10 ± 0.02) and ChatGPT-4o (0.07 ± 0.03). Readability was similar, though Gemini had a higher FKGL (11.3 ± 1.06) than Claude (10.3 ± 2.09). Claude Opus 4 showed superior accuracy, reliability, and usefulness, with significantly fewer hallucinations. Readability scores were similar across models. Further research is recommended.</p>","PeriodicalId":21322,"journal":{"name":"Rheumatology International","volume":"45 10","pages":"240"},"PeriodicalIF":2.9000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Rheumatology International","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00296-025-05996-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
This study compared language models' responses to open-ended questions on regenerative therapy guidelines for chronic pain, assessing their accuracy, reliability, usefulness, readability, semantic similarity, and hallucination rates. This cross-sectional study used 16 open-ended questions based on the American Society of Pain and Neuroscience's regenerative therapy guidelines for chronic pain. Questions were answered by ChatGPT-4o, Gemini 2.5 Flash, and Claude 4 Opus. Responses were rated on a 7-point Likert scale for usability and reliability, and a 5-point scale for accuracy. Hallucinogenicity, readability (FKRE, FKGL), and similarity (USE, ROUGE-L) were also assessed. Statistical comparisons were made, with significance set at p < 0.05. Claude Opus 4 showed the highest reliability (5.19 ± 1.11), usefulness (5.06 ± 1.0), and clinical accuracy (4.06 ± 0.68), outperforming ChatGPT-4o (4.13 ± 0.96; 3.94 ± 0.85; 3.38 ± 0.72) and Gemini 2.5 (4.19 ± 0.98; 4.06 ± 0.93; 3.38 ± 0.62). Claude had the lowest reference hallucinations (RHS 4.44 ± 3.18) vs. ChatGPT-4o (8.38 ± 1.86) and Gemini 2.5 (8.75 ± 1.73). In semantic similarity, Claude (0.68 ± 0.08) and Gemini (0.65 ± 0.07) surpassed ChatGPT-4o (0.60 ± 0.09). Gemini led in ROUGE-L F1 (0.12 ± 0.03) vs. Claude (0.10 ± 0.02) and ChatGPT-4o (0.07 ± 0.03). Readability was similar, though Gemini had a higher FKGL (11.3 ± 1.06) than Claude (10.3 ± 2.09). Claude Opus 4 showed superior accuracy, reliability, and usefulness, with significantly fewer hallucinations. Readability scores were similar across models. Further research is recommended.
期刊介绍:
RHEUMATOLOGY INTERNATIONAL is an independent journal reflecting world-wide progress in the research, diagnosis and treatment of the various rheumatic diseases. It is designed to serve researchers and clinicians in the field of rheumatology.
RHEUMATOLOGY INTERNATIONAL will cover all modern trends in clinical research as well as in the management of rheumatic diseases. Special emphasis will be given to public health issues related to rheumatic diseases, applying rheumatology research to clinical practice, epidemiology of rheumatic diseases, diagnostic tests for rheumatic diseases, patient reported outcomes (PROs) in rheumatology and evidence on education of rheumatology. Contributions to these topics will appear in the form of original publications, short communications, editorials, and reviews. "Letters to the editor" will be welcome as an enhancement to discussion. Basic science research, including in vitro or animal studies, is discouraged to submit, as we will only review studies on humans with an epidemological or clinical perspective. Case reports without a proper review of the literatura (Case-based Reviews) will not be published. Every effort will be made to ensure speed of publication while maintaining a high standard of contents and production.
Manuscripts submitted for publication must contain a statement to the effect that all human studies have been reviewed by the appropriate ethics committee and have therefore been performed in accordance with the ethical standards laid down in an appropriate version of the 1964 Declaration of Helsinki. It should also be stated clearly in the text that all persons gave their informed consent prior to their inclusion in the study. Details that might disclose the identity of the subjects under study should be omitted.