{"title":"Bias Sensitivity in Diagnostic Decision-Making: Comparing ChatGPT with Residents.","authors":"Henk G Schmidt, Jerome I Rotgans, Silvia Mamede","doi":"10.1007/s11606-024-09177-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Diagnostic errors, often due to biases in clinical reasoning, significantly affect patient care. While artificial intelligence chatbots like ChatGPT could help mitigate such biases, their potential susceptibility to biases is unknown.</p><p><strong>Methods: </strong>This study evaluated diagnostic accuracy of ChatGPT against the performance of 265 medical residents in five previously published experiments aimed at inducing bias. The residents worked in several major teaching hospitals in the Netherlands. The biases studied were case-intrinsic (presence of salient distracting findings in the patient history, effects of disruptive patient behaviors) and situational (prior availability of a look-alike patient). ChatGPT's accuracy in identifying the most-likely diagnosis was measured.</p><p><strong>Results: </strong>Diagnostic accuracy of residents and ChatGPT was equivalent. For clinical cases involving case-intrinsic bias, both ChatGPT and the residents exhibited a decline in diagnostic accuracy. Residents' accuracy decreased on average 12%, while the accuracy of ChatGPT 4.0 decreased 21%. Accuracy of ChatGPT 3.5 decreased 9%. These findings suggest that, like human diagnosticians, ChatGPT is sensitive to bias when the biasing information is part of the patient history. When the biasing information was extrinsic to the case in the form of the prior availability of a look-alike case, residents' accuracy decreased by 15%. By contrast, ChatGPT's performance was not affected by the biasing information. Chi-square goodness-of-fit tests corroborated these outcomes.</p><p><strong>Conclusions: </strong>It seems that, while ChatGPT is not sensitive to bias when biasing information is situational, it is sensitive to bias when the biasing information is part of the patient's disease history. Its utility in diagnostic support has potential, but caution is advised. Future research should enhance AI's bias detection and mitigation to make it truly useful for diagnostic support.</p>","PeriodicalId":15860,"journal":{"name":"Journal of General Internal Medicine","volume":" ","pages":"790-795"},"PeriodicalIF":4.3000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11914423/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of General Internal Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s11606-024-09177-9","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/7 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Diagnostic errors, often due to biases in clinical reasoning, significantly affect patient care. While artificial intelligence chatbots like ChatGPT could help mitigate such biases, their potential susceptibility to biases is unknown.
Methods: This study evaluated diagnostic accuracy of ChatGPT against the performance of 265 medical residents in five previously published experiments aimed at inducing bias. The residents worked in several major teaching hospitals in the Netherlands. The biases studied were case-intrinsic (presence of salient distracting findings in the patient history, effects of disruptive patient behaviors) and situational (prior availability of a look-alike patient). ChatGPT's accuracy in identifying the most-likely diagnosis was measured.
Results: Diagnostic accuracy of residents and ChatGPT was equivalent. For clinical cases involving case-intrinsic bias, both ChatGPT and the residents exhibited a decline in diagnostic accuracy. Residents' accuracy decreased on average 12%, while the accuracy of ChatGPT 4.0 decreased 21%. Accuracy of ChatGPT 3.5 decreased 9%. These findings suggest that, like human diagnosticians, ChatGPT is sensitive to bias when the biasing information is part of the patient history. When the biasing information was extrinsic to the case in the form of the prior availability of a look-alike case, residents' accuracy decreased by 15%. By contrast, ChatGPT's performance was not affected by the biasing information. Chi-square goodness-of-fit tests corroborated these outcomes.
Conclusions: It seems that, while ChatGPT is not sensitive to bias when biasing information is situational, it is sensitive to bias when the biasing information is part of the patient's disease history. Its utility in diagnostic support has potential, but caution is advised. Future research should enhance AI's bias detection and mitigation to make it truly useful for diagnostic support.
期刊介绍:
The Journal of General Internal Medicine is the official journal of the Society of General Internal Medicine. It promotes improved patient care, research, and education in primary care, general internal medicine, and hospital medicine. Its articles focus on topics such as clinical medicine, epidemiology, prevention, health care delivery, curriculum development, and numerous other non-traditional themes, in addition to classic clinical research on problems in internal medicine.