A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology.

IF 2 Q3 DERMATOLOGY

Indian Dermatology Online Journal Pub Date : 2025-02-27 eCollection Date: 2025-03-01 DOI:10.4103/idoj.idoj_221_24

Aravind Baskar Murthy, Vijayasankar Palaniappan, Suganya Radhakrishnan, Sathish Rajaa, Kaliaperumal Karthikeyan

{"title":"A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology.","authors":"Aravind Baskar Murthy, Vijayasankar Palaniappan, Suganya Radhakrishnan, Sathish Rajaa, Kaliaperumal Karthikeyan","doi":"10.4103/idoj.idoj_221_24","DOIUrl":null,"url":null,"abstract":"Background: With the growing interest in generative artificial intelligence (AI), the scientific community is witnessing the vast utility of large language models (LLMs) with chat interfaces such as ChatGPT and Microsoft Bing Chat in the medical field and research. This study aimed to investigate the accuracy of ChatGPT and Microsoft Bing Chat to answer questions on Dermatology, Venereology, and Leprosy, the frequency of artificial hallucinations, and to compare their performance with human respondents.Aim and objectives: The primary objective of the study was to compare the knowledge and interpretation abilities of LLMs (ChatGPT v3.5 and Microsoft Bing Chat) with human respondents (12 final-year postgraduates) and the secondary objective was to assess the incidence of artificial hallucinations with 60 questions prepared by the authors, including multiple choice questions (MCQs), fill-in-the-blanks and scenario-based questions.Materials and methods: The authors accessed two commercially available large language models (LLMs) with chat interfaces namely ChatGPT version 3.5 (OpenAI; San Francisco, CA) and Microsoft Bing Chat from August 10th to August 23rd, 2023.Results: In our testing set of 60 questions, Bing Chat outperformed ChatGPT and human respondents with a mean correct response score of 46.9 ± 0.7. The mean correct responses by ChatGPT and human respondents were 35.9 ± 0.5 and 25.8 ± 11.0, respectively. The overall accuracy of human respondents, ChatGPT and Bing Chat was observed to be 43%, 59.8%, and 78.2%, respectively. Of the MCQs, fill-in-the-blanks, and scenario-based questions, Bing Chat had the highest accuracy in all types of questions with statistical significance (P < 0.001 by ANOVA test). Topic-wise assessment of the performance of LLMs showed that Bing Chat performed better in all topics except vascular disorders, inflammatory disorders, and leprosy. Bing Chat performed better in answering easy and medium-difficulty questions with accuracies of 85.7% and 78%, respectively. In comparison, ChatGPT performed well on hard questions with an accuracy of 55% with statistical significance (P < 0.001 by ANOVA test). The mean number of questions answered by the human respondents among the 10 questions with multiple correct responses was 3 ± 1.4. The accuracy of LLMs in answering questions with multiple correct responses was assessed by employing two prompts. ChatGPT and Bing Chat could answer 3.1 ± 0.3 and 4 ± 0 questions respectively without prompting. On evaluating the ability of logical reasoning by the LLMs, it was found that ChatGPT gave logical reasoning in 47 ± 0.4 questions and Bing Chat in 53.9 ± 0.5 questions, irrespective of the correctness of the responses. ChatGPT exhibited artificial hallucination in 4 questions, even with 12 repeated inputs, which was not observed in Bing chat.Limitations: Variability in respondent accuracy, a small question set, and exclusion of newer AI models and image-based assessments.Conclusion: This study showed an overall better performance of LLMs compared to human respondents. However, the LLMs were less accurate than respondents in topics like inflammatory disorders and leprosy. Proper regulations concerning the use of LLMs are the need of the hour to avoid potential misuse.","PeriodicalId":13335,"journal":{"name":"Indian Dermatology Online Journal","volume":"16 2","pages":"241-247"},"PeriodicalIF":2.0000,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11927985/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indian Dermatology Online Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4103/idoj.idoj_221_24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"DERMATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: With the growing interest in generative artificial intelligence (AI), the scientific community is witnessing the vast utility of large language models (LLMs) with chat interfaces such as ChatGPT and Microsoft Bing Chat in the medical field and research. This study aimed to investigate the accuracy of ChatGPT and Microsoft Bing Chat to answer questions on Dermatology, Venereology, and Leprosy, the frequency of artificial hallucinations, and to compare their performance with human respondents.

Aim and objectives: The primary objective of the study was to compare the knowledge and interpretation abilities of LLMs (ChatGPT v3.5 and Microsoft Bing Chat) with human respondents (12 final-year postgraduates) and the secondary objective was to assess the incidence of artificial hallucinations with 60 questions prepared by the authors, including multiple choice questions (MCQs), fill-in-the-blanks and scenario-based questions.

Materials and methods: The authors accessed two commercially available large language models (LLMs) with chat interfaces namely ChatGPT version 3.5 (OpenAI; San Francisco, CA) and Microsoft Bing Chat from August 10^th to August 23^rd, 2023.

Results: In our testing set of 60 questions, Bing Chat outperformed ChatGPT and human respondents with a mean correct response score of 46.9 ± 0.7. The mean correct responses by ChatGPT and human respondents were 35.9 ± 0.5 and 25.8 ± 11.0, respectively. The overall accuracy of human respondents, ChatGPT and Bing Chat was observed to be 43%, 59.8%, and 78.2%, respectively. Of the MCQs, fill-in-the-blanks, and scenario-based questions, Bing Chat had the highest accuracy in all types of questions with statistical significance (P < 0.001 by ANOVA test). Topic-wise assessment of the performance of LLMs showed that Bing Chat performed better in all topics except vascular disorders, inflammatory disorders, and leprosy. Bing Chat performed better in answering easy and medium-difficulty questions with accuracies of 85.7% and 78%, respectively. In comparison, ChatGPT performed well on hard questions with an accuracy of 55% with statistical significance (P < 0.001 by ANOVA test). The mean number of questions answered by the human respondents among the 10 questions with multiple correct responses was 3 ± 1.4. The accuracy of LLMs in answering questions with multiple correct responses was assessed by employing two prompts. ChatGPT and Bing Chat could answer 3.1 ± 0.3 and 4 ± 0 questions respectively without prompting. On evaluating the ability of logical reasoning by the LLMs, it was found that ChatGPT gave logical reasoning in 47 ± 0.4 questions and Bing Chat in 53.9 ± 0.5 questions, irrespective of the correctness of the responses. ChatGPT exhibited artificial hallucination in 4 questions, even with 12 repeated inputs, which was not observed in Bing chat.

Limitations: Variability in respondent accuracy, a small question set, and exclusion of newer AI models and image-based assessments.

Conclusion: This study showed an overall better performance of LLMs compared to human respondents. However, the LLMs were less accurate than respondents in topics like inflammatory disorders and leprosy. Proper regulations concerning the use of LLMs are the need of the hour to avoid potential misuse.

查看原文本刊更多论文

皮肤病学中大型语言模型与人类应答者性能的比较分析。

背景：随着对生成式人工智能（AI）的兴趣日益浓厚，科学界正在见证大型语言模型（llm）与聊天界面（如ChatGPT和微软Bing chat）在医学领域和研究中的广泛应用。本研究旨在调查ChatGPT和微软必应聊天在回答皮肤病学、性病学和麻风病问题的准确性，以及人工幻觉的频率，并将它们的表现与人类受访者进行比较。目的和目标：本研究的主要目的是比较法学硕士（ChatGPT v3.5和微软必应聊天）与人类受访者（12名最后一年级研究生）的知识和解释能力，次要目的是通过作者准备的60个问题评估人工幻觉的发生率，包括多项选择题（mcq）、填空题和基于场景的问题。材料和方法：作者访问了两个具有聊天接口的商用大型语言模型（llm），即ChatGPT版本3.5 (OpenAI；和微软必应聊天，时间为2023年8月10日至8月23日。结果：在我们的60个问题的测试集中，Bing Chat的平均正确回答得分为46.9±0.7，优于ChatGPT和人类受访者。ChatGPT和人类受访者的平均正确率分别为35.9±0.5和25.8±11.0。人类受访者、ChatGPT和必应聊天的总体准确率分别为43%、59.8%和78.2%。在mcq、填空题和场景题中，Bing Chat在所有类型的问题中准确率最高，经方差分析（ANOVA）检验，P < 0.001。对法学硕士表现的主题评估显示，Bing Chat在除血管疾病、炎症疾病和麻风病外的所有主题上都表现更好。Bing Chat在回答简单和中等难度的问题上表现更好，准确率分别为85.7%和78%。相比之下，ChatGPT在难题上表现良好，准确率为55%，具有统计学意义（方差分析检验P < 0.001）。在有多个正确答案的10个问题中，人类被调查者回答的平均问题数为3±1.4。法学硕士在回答有多个正确答案的问题时的准确性通过采用两个提示来评估。ChatGPT和Bing Chat在没有提示的情况下分别回答了3.1±0.3和4±0个问题。在评估llm的逻辑推理能力时，我们发现，无论答案是否正确，ChatGPT在47±0.4个问题中给出了逻辑推理，Bing Chat在53.9±0.5个问题中给出了逻辑推理。ChatGPT在4个问题中出现了人工幻觉，即使重复输入了12次，在Bing聊天中没有观察到这种现象。局限性：受访者准确性的差异，小问题集，以及排除较新的人工智能模型和基于图像的评估。结论：本研究显示llm的总体表现优于人类应答者。然而，法学硕士在炎性疾病和麻风病等问题上的准确性低于受访者。为了避免潜在的误用，需要对法学硕士的使用进行适当的规定。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊