Large language models for dermatological image interpretation - a comparative study.

IF 2 Q2 MEDICINE, GENERAL & INTERNAL

Diagnosis Pub Date : 2025-05-23 DOI:10.1515/dx-2025-0014

Lasse Cirkel, Fabian Lechner, Lukas Alexander Henk, Martin Krusche, Martin C Hirsch, Michael Hertl, Sebastian Kuhn, Johannes Knitza

{"title":"Large language models for dermatological image interpretation - a comparative study.","authors":"Lasse Cirkel, Fabian Lechner, Lukas Alexander Henk, Martin Krusche, Martin C Hirsch, Michael Hertl, Sebastian Kuhn, Johannes Knitza","doi":"10.1515/dx-2025-0014","DOIUrl":null,"url":null,"abstract":"Objectives: Interpreting skin findings can be challenging for both laypersons and clinicians. Large language models (LLMs) offer accessible decision support, yet their diagnostic capabilities for dermatological images remain underexplored. This study evaluated the diagnostic performance of LLMs based on image interpretation of common dermatological diseases.Methods: A total of 500 dermatological images, encompassing four prevalent skin conditions (psoriasis, vitiligo, erysipelas and rosacea), were used to compare seven multimodal LLMs (GPT-4o, GPT-4o mini, Gemini 1.5 Pro, Gemini 1.5 Flash, Claude 3.5 Sonnet, Llama3.2 90B and 11B). A standardized prompt was used to generate one top diagnosis.Results: The highest overall accuracy was achieved by GPT-4o (67.8 %), followed by GPT-4o mini (63.8 %) and Llama3.2 11B (61.4 %). Accuracy varied considerably across conditions, with psoriasis with the highest mean LLM accuracy of 59.2 % and erysipelas demonstrating the lowest accuracy (33.4 %). 11.0 % of all images were misdiagnosed by all LLMs, whereas 11.6 % were correctly diagnosed by all models. Correct diagnoses by all LLMs were linked to clear, disease-specific features, such as sharply demarcated erythematous plaques in psoriasis. Llama3.2 90B was the only LLM to decline diagnosing images, particularly those involving intimate areas of the body.Conclusions: LLM performance varied significantly, emphasizing the need for cautious usage. Notably, a free, locally hostable model correctly identified the top diagnosis for approximately two-thirds of all images, demonstrating the potential for safer, locally deployed LLMs. Advancements in model accuracy and the integration of clinical metadata could further enhance accessible and reliable clinical decision support systems.","PeriodicalId":11273,"journal":{"name":"Diagnosis","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnosis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/dx-2025-0014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: Interpreting skin findings can be challenging for both laypersons and clinicians. Large language models (LLMs) offer accessible decision support, yet their diagnostic capabilities for dermatological images remain underexplored. This study evaluated the diagnostic performance of LLMs based on image interpretation of common dermatological diseases.

Methods: A total of 500 dermatological images, encompassing four prevalent skin conditions (psoriasis, vitiligo, erysipelas and rosacea), were used to compare seven multimodal LLMs (GPT-4o, GPT-4o mini, Gemini 1.5 Pro, Gemini 1.5 Flash, Claude 3.5 Sonnet, Llama3.2 90B and 11B). A standardized prompt was used to generate one top diagnosis.

Results: The highest overall accuracy was achieved by GPT-4o (67.8 %), followed by GPT-4o mini (63.8 %) and Llama3.2 11B (61.4 %). Accuracy varied considerably across conditions, with psoriasis with the highest mean LLM accuracy of 59.2 % and erysipelas demonstrating the lowest accuracy (33.4 %). 11.0 % of all images were misdiagnosed by all LLMs, whereas 11.6 % were correctly diagnosed by all models. Correct diagnoses by all LLMs were linked to clear, disease-specific features, such as sharply demarcated erythematous plaques in psoriasis. Llama3.2 90B was the only LLM to decline diagnosing images, particularly those involving intimate areas of the body.

Conclusions: LLM performance varied significantly, emphasizing the need for cautious usage. Notably, a free, locally hostable model correctly identified the top diagnosis for approximately two-thirds of all images, demonstrating the potential for safer, locally deployed LLMs. Advancements in model accuracy and the integration of clinical metadata could further enhance accessible and reliable clinical decision support systems.

查看原文本刊更多论文

用于皮肤病学图像解释的大型语言模型-比较研究。

目的：对外行人和临床医生来说，解释皮肤表现都是具有挑战性的。大型语言模型（llm）提供了可访问的决策支持，但其对皮肤病图像的诊断能力仍未得到充分开发。本研究基于常见皮肤病的图像解释评估LLMs的诊断性能。方法：采用包括银屑病、白癜风、丹痘和红斑痤疮在内的4种常见皮肤病的500张皮肤病图像，对7种多模式LLMs （gpt - 40、gpt - 40 mini、Gemini 1.5 Pro、Gemini 1.5 Flash、Claude 3.5 Sonnet、Llama3.2 90B和11B）进行比较。一个标准化的提示被用来生成一个顶级诊断。结果：gpt - 40的总体准确率最高（67.8% %），其次是gpt - 40 mini（63.8 %）和Llama3.2 11B（61.4 %）。准确度在不同情况下差异很大，牛皮癣的平均LLM准确度最高，为59.2% %，丹毒的准确度最低（33.4% %）。11.0 %的图像被所有llm误诊，而11.6 %的图像被所有模型正确诊断。所有llm的正确诊断都与明确的疾病特异性特征有关，例如银屑病中明显划分的红斑斑块。Llama3.2 90B是唯一一个拒绝诊断图像的LLM，尤其是那些涉及身体私密部位的图像。结论：LLM疗效差异显著，需谨慎使用。值得注意的是，一个免费的、本地托管的模型正确地识别了大约三分之二的图像的最高诊断，这表明了更安全的、本地部署的llm的潜力。模型准确性和临床元数据集成的进步可以进一步增强临床决策支持系统的可访问性和可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Diagnosis MEDICINE, GENERAL & INTERNAL-

CiteScore

7.20

自引率

5.70%

发文量

期刊介绍： Diagnosis focuses on how diagnosis can be advanced, how it is taught, and how and why it can fail, leading to diagnostic errors. The journal welcomes both fundamental and applied works, improvement initiatives, opinions, and debates to encourage new thinking on improving this critical aspect of healthcare quality.　 Topics: -Factors that promote diagnostic quality and safety -Clinical reasoning -Diagnostic errors in medicine -The factors that contribute to diagnostic error: human factors, cognitive issues, and system-related breakdowns -Improving the value of diagnosis – eliminating waste and unnecessary testing -How culture and removing blame promote awareness of diagnostic errors -Training and education related to clinical reasoning and diagnostic skills -Advances in laboratory testing and imaging that improve diagnostic capability -Local, national and international initiatives to reduce diagnostic error