Jennifer Chen, Josiah Hanson, Oliver H. Chang, Michi M. Shinohara
{"title":"ChatGPT for Pathology Reports in Cutaneous Lymphoma: Accuracy and Readability in Cutaneous Lymphoma","authors":"Jennifer Chen, Josiah Hanson, Oliver H. Chang, Michi M. Shinohara","doi":"10.1002/jvc2.602","DOIUrl":null,"url":null,"abstract":"<p>Artificial Intelligence (AI) has become more powerful and more integrated in our everyday lives, including in health care. ChatGPT has been proposed as a tool to act as a virtual “health assistant”, by giving health information in “succinct, clear overviews in layman's terms” [<span>1</span>]. While AI can potentially fill in gaps in health care, there are concerns about the accuracy of information output. With limited resources and potentially difficult-to-understand information, patients with rare diseases such as primary cutaneous lymphoma may turn to ChatGPT for answers. In this study, we assessed the accuracy and readability of ChatGPT's interpretation of cutaneous lymphoma pathology reports.</p><p>We randomly chose 41 cutaneous lymphoma pathology reports from patients at the University of Washington and Fred Hutch Cancer Center. We provided ChatGPT-3.5 the final diagnoses, comments, and addendums from deidentified pathology reports with the command “Interpret this pathology diagnosis for me in layman's terms.” ChatGPT interpretations were evaluated by three dermatopathologists, and errors were classified as clinically significant or non-clinically significant based on if the error could potentially change diagnosis or management.</p><p>Out of the 41 evaluated reports, we found seven clinically significant errors and 20 non-clinically significant errors (Table 1). Examples of clinically significant errors are shown in Table 1b.</p><p>Figure 1 shows average readability scores and grade levels for original pathology reports and ChatGPT interpretations. On average, original pathology reports had a Flesch reading ease score of 16.6 ± 11.0, corresponding to a grade level of 14.9 ± 2.8, approximately a college graduate. ChatGPT interpretations had an average Flesch reading ease score of 43.5 ± 11.8, corresponding to a grade level of 12.0 ± 1.8, approximately high school graduate. The mean difference in Flesch reading ease scores between original pathology reports and ChatGPT interpretations was 26.9 [23.1–30.7] (<i>p</i> < 0.01), corresponding to a decrease in grade level by 2.8 [2.0–3.7] (<i>p</i> < 0.01).</p><p>We found that ChatGPT interpretation of pathology reports of cutaneous lymphomas generated errors that could impact patients' understanding of their diagnosis or management if patients relied on the ChatGPT interpretation alone. For example, a pathology report with the original diagnosis of “primary cutaneous follicular lymphoma” was interpreted by ChatGPT as “a type of cancer called follicular lymphoma that usually starts in the lymph nodes,” implying that the diagnosis is systemic rather than primary cutaneous lymphoma. Relying on this interpretation could potentially lead to additional anxiety or stress on the part of the patient and/or inappropriate treatment if relied upon by clinicians. The clinically significant error rate in our study is higher than previously reported for ChatGPT-4's interpretation of pathology reports [<span>2</span>]. Possible explanations for the discrepancy in error rates could be that the previous study included more commonly known conditions, so ChatGPT had more available information to draw from. The previous study was conducted using ChatGPT-4, which is the latest version of ChatGPT to date. Using GPT-4 it is possible that our rate of “inclusion of information not originally in the report” error rate would be lower; according to OpenAI, ChatGPT-4 scored “40% higher on tests intended to measure hallucination or fabricating facts.” [<span>3</span>] We chose to use ChatGPT-3.5 in this study because it is free and more accessible than ChatGPT-4 [<span>4</span>]; this raises the possibility that patients who can't afford more recent version of ChatGPT might be at more risk of misinformation. We are also aware that patients have access to other forms of AI to interpret their pathology reports, such as Google Bard. While we did not study the accuracy of AI models outside of ChatGPT, previous assessments of AI accuracy showed that Google Bard was more prone to hallucination errors and less accurate than ChatGPT-4 in interpreting pathology reports from multiple organ systems [<span>2</span>]. Future studies assessing discrepancies between AI algorithm models could be useful for physicians when counselling patients about AI use.</p><p>ChatGPT's interpretations were significantly easier to read, and all three of our dermatopathologists agreed that the ChatGPT reorganised information in a more digestible way. However, even these easier to read ChatGPT interpretations were still too complex for the average US reading level, leading to disparities in access to information generated with AI [<span>5-7</span>].</p><p>In summary, while ChatGPT has the potential to increase accessibility of medical information for patients, its use in interpreting complex medical data, such as pathology reports, presents significant risks due to potential errors. It is essential that healthcare providers remain aware of these limitations and continue to validate AI-generated information before it is relied upon by patients or in the clinical setting.</p><p><b>Jennifer Chen:</b> conceptualisation, data curation, formal analysis, investigation, methodology, writing–original draft, writing–review and editing. <b>Josiah Hanson:</b> formal analysis, writing–review and editing. <b>Oliver H. Chang:</b> formal analysis, writing–review and editing. <b>Michi M. Shinohara:</b> conceptualisation, data curation, formal analysis, supervision, writing–review and editing.</p><p>No patient identifying details are presented in this study. Study protocol was in accordance with the ethical standards of the University of Washington IRB and with the Helsinki Declaration of 1975, as revised in 1983.</p><p>The authors declare no conflicts of interest.</p>","PeriodicalId":94325,"journal":{"name":"JEADV clinical practice","volume":"4 2","pages":"561-563"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jvc2.602","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JEADV clinical practice","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jvc2.602","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial Intelligence (AI) has become more powerful and more integrated in our everyday lives, including in health care. ChatGPT has been proposed as a tool to act as a virtual “health assistant”, by giving health information in “succinct, clear overviews in layman's terms” [1]. While AI can potentially fill in gaps in health care, there are concerns about the accuracy of information output. With limited resources and potentially difficult-to-understand information, patients with rare diseases such as primary cutaneous lymphoma may turn to ChatGPT for answers. In this study, we assessed the accuracy and readability of ChatGPT's interpretation of cutaneous lymphoma pathology reports.
We randomly chose 41 cutaneous lymphoma pathology reports from patients at the University of Washington and Fred Hutch Cancer Center. We provided ChatGPT-3.5 the final diagnoses, comments, and addendums from deidentified pathology reports with the command “Interpret this pathology diagnosis for me in layman's terms.” ChatGPT interpretations were evaluated by three dermatopathologists, and errors were classified as clinically significant or non-clinically significant based on if the error could potentially change diagnosis or management.
Out of the 41 evaluated reports, we found seven clinically significant errors and 20 non-clinically significant errors (Table 1). Examples of clinically significant errors are shown in Table 1b.
Figure 1 shows average readability scores and grade levels for original pathology reports and ChatGPT interpretations. On average, original pathology reports had a Flesch reading ease score of 16.6 ± 11.0, corresponding to a grade level of 14.9 ± 2.8, approximately a college graduate. ChatGPT interpretations had an average Flesch reading ease score of 43.5 ± 11.8, corresponding to a grade level of 12.0 ± 1.8, approximately high school graduate. The mean difference in Flesch reading ease scores between original pathology reports and ChatGPT interpretations was 26.9 [23.1–30.7] (p < 0.01), corresponding to a decrease in grade level by 2.8 [2.0–3.7] (p < 0.01).
We found that ChatGPT interpretation of pathology reports of cutaneous lymphomas generated errors that could impact patients' understanding of their diagnosis or management if patients relied on the ChatGPT interpretation alone. For example, a pathology report with the original diagnosis of “primary cutaneous follicular lymphoma” was interpreted by ChatGPT as “a type of cancer called follicular lymphoma that usually starts in the lymph nodes,” implying that the diagnosis is systemic rather than primary cutaneous lymphoma. Relying on this interpretation could potentially lead to additional anxiety or stress on the part of the patient and/or inappropriate treatment if relied upon by clinicians. The clinically significant error rate in our study is higher than previously reported for ChatGPT-4's interpretation of pathology reports [2]. Possible explanations for the discrepancy in error rates could be that the previous study included more commonly known conditions, so ChatGPT had more available information to draw from. The previous study was conducted using ChatGPT-4, which is the latest version of ChatGPT to date. Using GPT-4 it is possible that our rate of “inclusion of information not originally in the report” error rate would be lower; according to OpenAI, ChatGPT-4 scored “40% higher on tests intended to measure hallucination or fabricating facts.” [3] We chose to use ChatGPT-3.5 in this study because it is free and more accessible than ChatGPT-4 [4]; this raises the possibility that patients who can't afford more recent version of ChatGPT might be at more risk of misinformation. We are also aware that patients have access to other forms of AI to interpret their pathology reports, such as Google Bard. While we did not study the accuracy of AI models outside of ChatGPT, previous assessments of AI accuracy showed that Google Bard was more prone to hallucination errors and less accurate than ChatGPT-4 in interpreting pathology reports from multiple organ systems [2]. Future studies assessing discrepancies between AI algorithm models could be useful for physicians when counselling patients about AI use.
ChatGPT's interpretations were significantly easier to read, and all three of our dermatopathologists agreed that the ChatGPT reorganised information in a more digestible way. However, even these easier to read ChatGPT interpretations were still too complex for the average US reading level, leading to disparities in access to information generated with AI [5-7].
In summary, while ChatGPT has the potential to increase accessibility of medical information for patients, its use in interpreting complex medical data, such as pathology reports, presents significant risks due to potential errors. It is essential that healthcare providers remain aware of these limitations and continue to validate AI-generated information before it is relied upon by patients or in the clinical setting.
Jennifer Chen: conceptualisation, data curation, formal analysis, investigation, methodology, writing–original draft, writing–review and editing. Josiah Hanson: formal analysis, writing–review and editing. Oliver H. Chang: formal analysis, writing–review and editing. Michi M. Shinohara: conceptualisation, data curation, formal analysis, supervision, writing–review and editing.
No patient identifying details are presented in this study. Study protocol was in accordance with the ethical standards of the University of Washington IRB and with the Helsinki Declaration of 1975, as revised in 1983.