Scenario-based evaluation of large language models for reference accuracy in dermatology: literature retrieval on latent tuberculosis in psoriasis patients on anti-IL-17/23 therapy.
Nihal Altunisik, Sibel Altunisik Toplu, Dursun Turkmen
{"title":"Scenario-based evaluation of large language models for reference accuracy in dermatology: literature retrieval on latent tuberculosis in psoriasis patients on anti-IL-17/23 therapy.","authors":"Nihal Altunisik, Sibel Altunisik Toplu, Dursun Turkmen","doi":"10.1080/15569527.2026.2656177","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) could accelerate clinical literature searches, but their reliability is compromised by \"hallucinations\" generating false references. This study compared three general-purpose LLMs using a standardized dermatology literature retrieval prompt for reference accuracy, relevance, and hallucination rates.</p><p><strong>Methods: </strong>A clinical scenario on latent tuberculosis management in psoriasis patients on IL-17/23 inhibitors was defined. To establish a reference standard, references (n=74) from the two most recent and comprehensive systematic reviews on the topic were screened. These two reviews were selected as they represented the most current and complete syntheses of evidence on this clinical question; using their reference lists ensured a focused, expert-validated foundation for evaluating LLM outputs. This process yielded 16 studies directly addressing the scenario. Each LLM (ChatGPT, Gemini, Deepseek-V3.2) was prompted to list 15 recent specific references. The 45 retrieved references were manually validated as: \"True and Relevant,\" \"True but Irrelevant/General,\" or \"False/Hallucination.\" Distributions were compared using Pearson's chi-square test.</p><p><strong>Results: </strong>A significant difference was found between models (p<0.010). ChatGPT listed 80.0% (12/15) correct and relevant references with no hallucinations. Gemini produced 80.0% (12/15) hallucinations, while Deepseek-V3.2 generated 100.0% fictional references. Notably, 4 references ChatGPT found correct were valid articles overlooked in the predefined pool; these were verified as relevant, indicating the reference standard may not have been exhaustive.</p><p><strong>Conclusion: </strong>LLM performance varies considerably with high hallucination risk. Findings highlight caution and independent verification. Future research should test advanced query techniques and hybrid systems integrating LLMs with academic databases.</p>","PeriodicalId":11023,"journal":{"name":"Cutaneous and Ocular Toxicology","volume":" ","pages":"1-6"},"PeriodicalIF":1.3000,"publicationDate":"2026-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cutaneous and Ocular Toxicology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/15569527.2026.2656177","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large language models (LLMs) could accelerate clinical literature searches, but their reliability is compromised by "hallucinations" generating false references. This study compared three general-purpose LLMs using a standardized dermatology literature retrieval prompt for reference accuracy, relevance, and hallucination rates.
Methods: A clinical scenario on latent tuberculosis management in psoriasis patients on IL-17/23 inhibitors was defined. To establish a reference standard, references (n=74) from the two most recent and comprehensive systematic reviews on the topic were screened. These two reviews were selected as they represented the most current and complete syntheses of evidence on this clinical question; using their reference lists ensured a focused, expert-validated foundation for evaluating LLM outputs. This process yielded 16 studies directly addressing the scenario. Each LLM (ChatGPT, Gemini, Deepseek-V3.2) was prompted to list 15 recent specific references. The 45 retrieved references were manually validated as: "True and Relevant," "True but Irrelevant/General," or "False/Hallucination." Distributions were compared using Pearson's chi-square test.
Results: A significant difference was found between models (p<0.010). ChatGPT listed 80.0% (12/15) correct and relevant references with no hallucinations. Gemini produced 80.0% (12/15) hallucinations, while Deepseek-V3.2 generated 100.0% fictional references. Notably, 4 references ChatGPT found correct were valid articles overlooked in the predefined pool; these were verified as relevant, indicating the reference standard may not have been exhaustive.
Conclusion: LLM performance varies considerably with high hallucination risk. Findings highlight caution and independent verification. Future research should test advanced query techniques and hybrid systems integrating LLMs with academic databases.
期刊介绍:
Cutaneous and Ocular Toxicology is an international, peer-reviewed journal that covers all types of harm to cutaneous and ocular systems. Areas of particular interest include pharmaceutical and medical products; consumer, personal care, and household products; and issues in environmental and occupational exposures.
In addition to original research papers, reviews and short communications are invited, as well as concise, relevant, and critical reviews of topics of contemporary significance.