Scenario-based evaluation of large language models for reference accuracy in dermatology: literature retrieval on latent tuberculosis in psoriasis patients on anti-IL-17/23 therapy.

IF 1.3 4区医学 Q3 OPHTHALMOLOGY

Cutaneous and Ocular Toxicology Pub Date : 2026-04-16 DOI:10.1080/15569527.2026.2656177

Nihal Altunisik, Sibel Altunisik Toplu, Dursun Turkmen

{"title":"Scenario-based evaluation of large language models for reference accuracy in dermatology: literature retrieval on latent tuberculosis in psoriasis patients on anti-IL-17/23 therapy.","authors":"Nihal Altunisik, Sibel Altunisik Toplu, Dursun Turkmen","doi":"10.1080/15569527.2026.2656177","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) could accelerate clinical literature searches, but their reliability is compromised by \"hallucinations\" generating false references. This study compared three general-purpose LLMs using a standardized dermatology literature retrieval prompt for reference accuracy, relevance, and hallucination rates.Methods: A clinical scenario on latent tuberculosis management in psoriasis patients on IL-17/23 inhibitors was defined. To establish a reference standard, references (n=74) from the two most recent and comprehensive systematic reviews on the topic were screened. These two reviews were selected as they represented the most current and complete syntheses of evidence on this clinical question; using their reference lists ensured a focused, expert-validated foundation for evaluating LLM outputs. This process yielded 16 studies directly addressing the scenario. Each LLM (ChatGPT, Gemini, Deepseek-V3.2) was prompted to list 15 recent specific references. The 45 retrieved references were manually validated as: \"True and Relevant,\" \"True but Irrelevant/General,\" or \"False/Hallucination.\" Distributions were compared using Pearson's chi-square test.Results: A significant difference was found between models (p<0.010). ChatGPT listed 80.0% (12/15) correct and relevant references with no hallucinations. Gemini produced 80.0% (12/15) hallucinations, while Deepseek-V3.2 generated 100.0% fictional references. Notably, 4 references ChatGPT found correct were valid articles overlooked in the predefined pool; these were verified as relevant, indicating the reference standard may not have been exhaustive.Conclusion: LLM performance varies considerably with high hallucination risk. Findings highlight caution and independent verification. Future research should test advanced query techniques and hybrid systems integrating LLMs with academic databases.","PeriodicalId":11023,"journal":{"name":"Cutaneous and Ocular Toxicology","volume":" ","pages":"1-6"},"PeriodicalIF":1.3000,"publicationDate":"2026-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cutaneous and Ocular Toxicology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/15569527.2026.2656177","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) could accelerate clinical literature searches, but their reliability is compromised by "hallucinations" generating false references. This study compared three general-purpose LLMs using a standardized dermatology literature retrieval prompt for reference accuracy, relevance, and hallucination rates.

Methods: A clinical scenario on latent tuberculosis management in psoriasis patients on IL-17/23 inhibitors was defined. To establish a reference standard, references (n=74) from the two most recent and comprehensive systematic reviews on the topic were screened. These two reviews were selected as they represented the most current and complete syntheses of evidence on this clinical question; using their reference lists ensured a focused, expert-validated foundation for evaluating LLM outputs. This process yielded 16 studies directly addressing the scenario. Each LLM (ChatGPT, Gemini, Deepseek-V3.2) was prompted to list 15 recent specific references. The 45 retrieved references were manually validated as: "True and Relevant," "True but Irrelevant/General," or "False/Hallucination." Distributions were compared using Pearson's chi-square test.

Results: A significant difference was found between models (p<0.010). ChatGPT listed 80.0% (12/15) correct and relevant references with no hallucinations. Gemini produced 80.0% (12/15) hallucinations, while Deepseek-V3.2 generated 100.0% fictional references. Notably, 4 references ChatGPT found correct were valid articles overlooked in the predefined pool; these were verified as relevant, indicating the reference standard may not have been exhaustive.

Conclusion: LLM performance varies considerably with high hallucination risk. Findings highlight caution and independent verification. Future research should test advanced query techniques and hybrid systems integrating LLMs with academic databases.

查看原文本刊更多论文

基于场景的皮肤病学大语言模型参考准确性评估：关于抗il -17/23治疗银屑病患者潜伏性结核的文献检索

背景：大型语言模型（llm）可以加速临床文献检索，但其可靠性受到“幻觉”产生虚假参考文献的影响。本研究比较了三个通用法学硕士使用标准化皮肤病学文献检索提示的参考准确性，相关性和幻觉率。方法：对银屑病患者应用IL-17/23抑制剂治疗潜伏结核的临床情况进行分析。为了建立参考标准，筛选了关于该主题的两篇最新的全面系统综述中的参考文献（n=74）。选择这两篇综述是因为它们代表了关于这一临床问题的最新和最完整的综合证据；使用他们的参考列表确保了评估法学硕士产出的重点，专家验证的基础。这一过程产生了16项直接解决这一问题的研究。每个LLM （ChatGPT, Gemini, Deepseek-V3.2）都被提示列出15个最近的特定参考文献。检索到的45个参考文献被手动验证为：“真实且相关”、“真实但不相关/一般”或“虚假/幻觉”。分布比较采用Pearson卡方检验。结果：不同模型之间存在显著差异(p)。结论：LLM的性能差异较大，存在较高的幻觉风险。研究结果强调谨慎和独立验证。未来的研究应该测试先进的查询技术和将法学硕士与学术数据库集成的混合系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cutaneous and Ocular Toxicology 医学-毒理学

CiteScore

3.30

自引率

6.20%

发文量

审稿时长

1 months

期刊介绍： Cutaneous and Ocular Toxicology is an international, peer-reviewed journal that covers all types of harm to cutaneous and ocular systems. Areas of particular interest include pharmaceutical and medical products; consumer, personal care, and household products; and issues in environmental and occupational exposures. In addition to original research papers, reviews and short communications are invited, as well as concise, relevant, and critical reviews of topics of contemporary significance.