Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience

medRxiv - Health Informatics Pub Date : 2024-07-26 DOI:10.1101/2024.07.23.24310847

Marina Alvarez-Estape, Ivan Cano, Rosa Pino, Carla González Grado, Andrea Aldemira-Liz, Javier Gonzálvez-Ortuño, Juanjo do Olmo, Javier Logroño, Marcelo Martínez, Carlos Mascías, Julián Isla, Jordi Martínez Roldán, Cristian Launes, Francesc Garcia-Cuyas, Paula Esteller-Cucala

{"title":"Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience","authors":"Marina Alvarez-Estape, Ivan Cano, Rosa Pino, Carla González Grado, Andrea Aldemira-Liz, Javier Gonzálvez-Ortuño, Juanjo do Olmo, Javier Logroño, Marcelo Martínez, Carlos Mascías, Julián Isla, Jordi Martínez Roldán, Cristian Launes, Francesc Garcia-Cuyas, Paula Esteller-Cucala","doi":"10.1101/2024.07.23.24310847","DOIUrl":null,"url":null,"abstract":"Importance The time to accurately diagnose rare pediatric diseases often spans years. Assessing the diagnostic accuracy of an LLM-based tool on real pediatric cases can help reduce this time, providing quicker diagnoses for patients and their families. Objective To evaluate the clinical utility of DxGPT as a support tool for differential diagnosis of both common and rare diseases. Design Unicentric descriptive cross-sectional exploratory study. Anonymized data from 50 pediatric patients' medical histories, covering common and rare pathologies, were used to generate clinical case notes. Each clinical case included essential data, with some expanded by complementary tests. Setting This study was conducted at a reference pediatric hospital, Sant Joan de Déu Barcelona Children′s Hospital. Participants A total of 50 clinical cases were diagnosed by 78 volunteer doctors (medical diagnostic team) with varying experience, each reviewing 3 clinical cases. Interventions Each clinician listed up to five diagnoses per clinical case note. The same was done on the DxGPT web platform, obtaining the Top-5 diagnostic proposals. To evaluate DxGPT's variability, each note was queried three times. Main Outcome(s) and Measure(s) The study mainly focused on comparing diagnostic accuracy, defined as the percentage of cases with the correct diagnosis, between the medical diagnostic team and DxGPT. Other evaluation criteria included qualitative assessments. The medical diagnostic team also completed a survey on their user experience with DxGPT.\nResults Top-5 diagnostic accuracy was 65% for clinicians and 60% for DxGPT, with no significant differences. Accuracies for common diseases were higher (Clinicians: 79%, DxGPT: 71%) than for rare diseases (Clinicians: 50%, DxGPT: 49%). Accuracy increased similarly in both groups with expanded information, but this increase was only stastically significant in clinicians (simple 52% vs. expanded 69%; p=0.03). DxGPT′s response variability affected less than 5% of clinical case notes. A survey of 48 clinicians rated the DxGPT platform 3.9/5 overall, 4.1/5 for usefulness, and 4.5/5 for usability. Conclusions and Relevance DxGPT showed diagnostic accuracies similar to medical staff from a pediatric hospital, indicating its potential for supporting differential diagnosis in other settings. Clinicians praised its usability and simplicity. These tools could provide new insights for challenging diagnostic cases.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"93 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.23.24310847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Importance The time to accurately diagnose rare pediatric diseases often spans years. Assessing the diagnostic accuracy of an LLM-based tool on real pediatric cases can help reduce this time, providing quicker diagnoses for patients and their families. Objective To evaluate the clinical utility of DxGPT as a support tool for differential diagnosis of both common and rare diseases. Design Unicentric descriptive cross-sectional exploratory study. Anonymized data from 50 pediatric patients' medical histories, covering common and rare pathologies, were used to generate clinical case notes. Each clinical case included essential data, with some expanded by complementary tests. Setting This study was conducted at a reference pediatric hospital, Sant Joan de Déu Barcelona Children′s Hospital. Participants A total of 50 clinical cases were diagnosed by 78 volunteer doctors (medical diagnostic team) with varying experience, each reviewing 3 clinical cases. Interventions Each clinician listed up to five diagnoses per clinical case note. The same was done on the DxGPT web platform, obtaining the Top-5 diagnostic proposals. To evaluate DxGPT's variability, each note was queried three times. Main Outcome(s) and Measure(s) The study mainly focused on comparing diagnostic accuracy, defined as the percentage of cases with the correct diagnosis, between the medical diagnostic team and DxGPT. Other evaluation criteria included qualitative assessments. The medical diagnostic team also completed a survey on their user experience with DxGPT. Results Top-5 diagnostic accuracy was 65% for clinicians and 60% for DxGPT, with no significant differences. Accuracies for common diseases were higher (Clinicians: 79%, DxGPT: 71%) than for rare diseases (Clinicians: 50%, DxGPT: 49%). Accuracy increased similarly in both groups with expanded information, but this increase was only stastically significant in clinicians (simple 52% vs. expanded 69%; p=0.03). DxGPT′s response variability affected less than 5% of clinical case notes. A survey of 48 clinicians rated the DxGPT platform 3.9/5 overall, 4.1/5 for usefulness, and 4.5/5 for usability. Conclusions and Relevance DxGPT showed diagnostic accuracies similar to medical staff from a pediatric hospital, indicating its potential for supporting differential diagnosis in other settings. Clinicians praised its usability and simplicity. These tools could provide new insights for challenging diagnostic cases.

查看原文本刊更多论文

通过分析诊断准确性和用户体验评估基于 GPT-4 的大语言模型 DxGPT 的临床实用性

重要性准确诊断罕见儿科疾病往往需要数年时间。在真实儿科病例中评估基于 LLM 的工具的诊断准确性有助于缩短诊断时间，为患者及其家属提供更快的诊断。目标评估 DxGPT 作为常见病和罕见病鉴别诊断辅助工具的临床实用性。设计单中心描述性横断面探索性研究。使用 50 名儿科患者的匿名病史数据生成临床病例记录，涵盖常见病和罕见病。每个临床病例都包括基本数据，其中一些数据还通过补充检验进行了扩展。研究地点本研究在一家儿科参考医院--巴塞罗那 Sant Joan de Déu 儿童医院进行。参与者 78名经验各异的志愿医生（医疗诊断小组）共诊断了50个临床病例，每人审查3个临床病例。干预措施每位临床医生在每份临床病例记录中最多列出 5 项诊断。在 DxGPT 网络平台上也进行了同样的操作，获得了前 5 项诊断建议。为评估 DxGPT 的可变性，每份病例记录被查询三次。主要结果和衡量标准该研究主要侧重于比较医疗诊断团队和 DxGPT 的诊断准确性，即诊断正确的病例百分比。其他评估标准包括定性评估。医疗诊断团队还完成了一份关于 DxGPT 用户体验的调查。常见疾病的准确率（临床医生：79%，DxGPT：71%）高于罕见疾病（临床医生：50%，DxGPT：49%）。扩充信息后，两组的准确率都有类似的提高，但只有临床医生的准确率有显著提高（简单 52% 对扩充 69%；P=0.03）。DxGPT 的反应变异对临床病例记录的影响不到 5%。对 48 名临床医生进行的调查显示，DxGPT 平台的总体评分为 3.9/5，实用性评分为 4.1/5，可用性评分为 4.5/5。结论和相关性 DxGPT 显示的诊断准确率与儿科医院的医务人员相似，这表明它在其他环境下支持鉴别诊断的潜力。临床医生对其可用性和简易性大加赞赏。这些工具可为具有挑战性的诊断病例提供新的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv - Health Informatics

自引率

0.00%

发文量