Evaluation of ChatGPT-Generated Differential Diagnosis for Common Diseases With Atypical Presentation: Descriptive Research.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

JMIR Medical Education Pub Date : 2024-06-21 DOI:10.2196/58758

Kiyoshi Shikino, Taro Shimizu, Yuki Otsuka, Masaki Tago, Hiromizu Takahashi, Takashi Watari, Yosuke Sasaki, Gemmei Iizuka, Hiroki Tamura, Koichi Nakashima, Kotaro Kunitomo, Morika Suzuki, Sayaka Aoyama, Shintaro Kosaka, Teiko Kawahigashi, Tomohiro Matsumoto, Fumina Orihara, Toru Morikawa, Toshinori Nishizawa, Yoji Hoshina, Yu Yamamoto, Yuichiro Matsuo, Yuto Unoki, Hirofumi Kimura, Midori Tokushima, Satoshi Watanuki, Takuma Saito, Fumio Otsuka, Yasuharu Tokuda

{"title":"Evaluation of ChatGPT-Generated Differential Diagnosis for Common Diseases With Atypical Presentation: Descriptive Research.","authors":"Kiyoshi Shikino, Taro Shimizu, Yuki Otsuka, Masaki Tago, Hiromizu Takahashi, Takashi Watari, Yosuke Sasaki, Gemmei Iizuka, Hiroki Tamura, Koichi Nakashima, Kotaro Kunitomo, Morika Suzuki, Sayaka Aoyama, Shintaro Kosaka, Teiko Kawahigashi, Tomohiro Matsumoto, Fumina Orihara, Toru Morikawa, Toshinori Nishizawa, Yoji Hoshina, Yu Yamamoto, Yuichiro Matsuo, Yuto Unoki, Hirofumi Kimura, Midori Tokushima, Satoshi Watanuki, Takuma Saito, Fumio Otsuka, Yasuharu Tokuda","doi":"10.2196/58758","DOIUrl":null,"url":null,"abstract":"Background: The persistence of diagnostic errors, despite advances in medical knowledge and diagnostics, highlights the importance of understanding atypical disease presentations and their contribution to mortality and morbidity. Artificial intelligence (AI), particularly generative pre-trained transformers like GPT-4, holds promise for improving diagnostic accuracy, but requires further exploration in handling atypical presentations.Objective: This study aimed to assess the diagnostic accuracy of ChatGPT in generating differential diagnoses for atypical presentations of common diseases, with a focus on the model's reliance on patient history during the diagnostic process.Methods: We used 25 clinical vignettes from the Journal of Generalist Medicine characterizing atypical manifestations of common diseases. Two general medicine physicians categorized the cases based on atypicality. ChatGPT was then used to generate differential diagnoses based on the clinical information provided. The concordance between AI-generated and final diagnoses was measured, with a focus on the top-ranked disease (top 1) and the top 5 differential diagnoses (top 5).Results: ChatGPT's diagnostic accuracy decreased with an increase in atypical presentation. For category 1 (C1) cases, the concordance rates were 17% (n=1) for the top 1 and 67% (n=4) for the top 5. Categories 3 (C3) and 4 (C4) showed a 0% concordance for top 1 and markedly lower rates for the top 5, indicating difficulties in handling highly atypical cases. The χ2 test revealed no significant difference in the top 1 differential diagnosis accuracy between less atypical (C1+C2) and more atypical (C3+C4) groups (χ²1=2.07; n=25; P=.13). However, a significant difference was found in the top 5 analyses, with less atypical cases showing higher accuracy (χ²1=4.01; n=25; P=.048).Conclusions: ChatGPT-4 demonstrates potential as an auxiliary tool for diagnosing typical and mildly atypical presentations of common diseases. However, its performance declines with greater atypicality. The study findings underscore the need for AI systems to encompass a broader range of linguistic capabilities, cultural understanding, and diverse clinical scenarios to improve diagnostic utility in real-world settings.","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"10 ","pages":"e58758"},"PeriodicalIF":3.2000,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11199925/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/58758","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The persistence of diagnostic errors, despite advances in medical knowledge and diagnostics, highlights the importance of understanding atypical disease presentations and their contribution to mortality and morbidity. Artificial intelligence (AI), particularly generative pre-trained transformers like GPT-4, holds promise for improving diagnostic accuracy, but requires further exploration in handling atypical presentations.

Objective: This study aimed to assess the diagnostic accuracy of ChatGPT in generating differential diagnoses for atypical presentations of common diseases, with a focus on the model's reliance on patient history during the diagnostic process.

Methods: We used 25 clinical vignettes from the Journal of Generalist Medicine characterizing atypical manifestations of common diseases. Two general medicine physicians categorized the cases based on atypicality. ChatGPT was then used to generate differential diagnoses based on the clinical information provided. The concordance between AI-generated and final diagnoses was measured, with a focus on the top-ranked disease (top 1) and the top 5 differential diagnoses (top 5).

Results: ChatGPT's diagnostic accuracy decreased with an increase in atypical presentation. For category 1 (C1) cases, the concordance rates were 17% (n=1) for the top 1 and 67% (n=4) for the top 5. Categories 3 (C3) and 4 (C4) showed a 0% concordance for top 1 and markedly lower rates for the top 5, indicating difficulties in handling highly atypical cases. The χ2 test revealed no significant difference in the top 1 differential diagnosis accuracy between less atypical (C1+C2) and more atypical (C3+C4) groups (χ²1=2.07; n=25; P=.13). However, a significant difference was found in the top 5 analyses, with less atypical cases showing higher accuracy (χ²1=4.01; n=25; P=.048).

Conclusions: ChatGPT-4 demonstrates potential as an auxiliary tool for diagnosing typical and mildly atypical presentations of common diseases. However, its performance declines with greater atypicality. The study findings underscore the need for AI systems to encompass a broader range of linguistic capabilities, cultural understanding, and diverse clinical scenarios to improve diagnostic utility in real-world settings.

查看原文本刊更多论文

评估由 ChatGPT 生成的对表现不典型的常见疾病的鉴别诊断：描述性研究。

背景：尽管医学知识和诊断技术不断进步，但诊断错误依然存在，这凸显了了解非典型疾病表现及其对死亡率和发病率影响的重要性。人工智能（AI），尤其是像 GPT-4 这样的生成式预训练转换器，有望提高诊断准确性，但在处理非典型表现方面还需要进一步探索：本研究旨在评估 ChatGPT 在生成常见疾病非典型表现的鉴别诊断时的诊断准确性，重点关注该模型在诊断过程中对患者病史的依赖：我们使用了《全科医学杂志》中的 25 个临床案例，这些案例描述了常见疾病的非典型表现。两名全科医生根据非典型性对病例进行了分类。然后使用 ChatGPT 根据所提供的临床信息生成鉴别诊断。测量了人工智能生成的诊断与最终诊断之间的一致性，重点是排名前列的疾病（前 1 位）和前 5 位的鉴别诊断（前 5 位）：结果：ChatGPT 的诊断准确率随着非典型表现的增加而降低。对于第 1 类（C1）病例，排名前 1 位的吻合率为 17%（n=1），排名前 5 位的吻合率为 67%（n=4）。第 3 类（C3）和第 4 类（C4）前 1 名的吻合率为 0%，前 5 名的吻合率明显较低，这表明在处理高度不典型病例时存在困难。χ2检验显示，非典型较少（C1+C2）组和非典型较多（C3+C4）组在前1项鉴别诊断准确率上无显著差异（χ²1=2.07；n=25；P=.13）。然而，在前 5 项分析中发现了明显差异，非典型较少的病例显示出更高的准确性（χ²1=4.01；n=25；P=.048）：结论：ChatGPT-4可作为诊断常见疾病的典型和轻度不典型表现的辅助工具。然而，随着非典型性的增加，其性能也会下降。研究结果表明，人工智能系统需要具备更广泛的语言能力、文化理解能力和多样化的临床场景，以提高在真实世界环境中的诊断效用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊