Krishna D Unadkat, Isra Abdulwadood, Annika N Hiredesai, Carina P Howlett, Laura E Geldmaker, Shelley S Noland
{"title":"ChatGPT 4.0's efficacy in the self-diagnosis of non-traumatic hand conditions.","authors":"Krishna D Unadkat, Isra Abdulwadood, Annika N Hiredesai, Carina P Howlett, Laura E Geldmaker, Shelley S Noland","doi":"10.1016/j.jham.2025.100217","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>With advancements in artificial intelligence, patients increasingly turn to generative AI models like ChatGPT for medical advice. This study explores the utility of ChatGPT 4.0 (GPT-4.0), the most recent version of ChatGPT, as an interim diagnostician for common hand conditions. Secondarily, the study evaluates the terminology GPT-4.0 associates with each condition by assessing its ability to generate condition-specific questions from a patient's perspective.</p><p><strong>Methods: </strong>Five common hand conditions were identified: trigger finger (TF), Dupuytren's Contracture (DC), carpal tunnel syndrome (CTS), de Quervain's tenosynovitis (DQT), and thumb carpometacarpal osteoarthritis (CMC). GPT-4.0 was queried with author-generated questions. The frequency of correct diagnoses, differential diagnoses, and recommendations were recorded. Chi-squared and pairwise Fisher's exact tests were used to compare response accuracy between conditions. GPT-4.0 was prompted to produce its own questions. Common terms in responses were recorded.</p><p><strong>Results: </strong>GPT-4.0's diagnostic accuracy significantly differed between conditions (p < 0.005). While GPT-4.0 diagnosed CTS, TF, DQT, and DC with >95 % accuracy, 60 % (n = 15) of CMC queries were correctly diagnosed. Additionally, there were significant differences in providing of differential diagnoses (p < 0.005), diagnostic tests (p < 0.005), and risk factors (p < 0.05). GPT-4.0 recommended visiting a healthcare provider for 97 % (n = 121) of the questions. Analysis of ChatGPT-generated questions showed four of the ten most used terms were shared between DQT and CMC.</p><p><strong>Conclusions: </strong>The results suggest that GPT-4.0 has potential preliminary diagnostic utility. Future studies should further investigate factors that improve or worsen AI's diagnostic power and consider the implications of patient utilization.</p>","PeriodicalId":45368,"journal":{"name":"Journal of Hand and Microsurgery","volume":"17 3","pages":"100217"},"PeriodicalIF":0.5000,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11849648/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Hand and Microsurgery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.jham.2025.100217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: With advancements in artificial intelligence, patients increasingly turn to generative AI models like ChatGPT for medical advice. This study explores the utility of ChatGPT 4.0 (GPT-4.0), the most recent version of ChatGPT, as an interim diagnostician for common hand conditions. Secondarily, the study evaluates the terminology GPT-4.0 associates with each condition by assessing its ability to generate condition-specific questions from a patient's perspective.
Methods: Five common hand conditions were identified: trigger finger (TF), Dupuytren's Contracture (DC), carpal tunnel syndrome (CTS), de Quervain's tenosynovitis (DQT), and thumb carpometacarpal osteoarthritis (CMC). GPT-4.0 was queried with author-generated questions. The frequency of correct diagnoses, differential diagnoses, and recommendations were recorded. Chi-squared and pairwise Fisher's exact tests were used to compare response accuracy between conditions. GPT-4.0 was prompted to produce its own questions. Common terms in responses were recorded.
Results: GPT-4.0's diagnostic accuracy significantly differed between conditions (p < 0.005). While GPT-4.0 diagnosed CTS, TF, DQT, and DC with >95 % accuracy, 60 % (n = 15) of CMC queries were correctly diagnosed. Additionally, there were significant differences in providing of differential diagnoses (p < 0.005), diagnostic tests (p < 0.005), and risk factors (p < 0.05). GPT-4.0 recommended visiting a healthcare provider for 97 % (n = 121) of the questions. Analysis of ChatGPT-generated questions showed four of the ten most used terms were shared between DQT and CMC.
Conclusions: The results suggest that GPT-4.0 has potential preliminary diagnostic utility. Future studies should further investigate factors that improve or worsen AI's diagnostic power and consider the implications of patient utilization.