ChatGPT for Pathology Reports in Cutaneous Lymphoma: Accuracy and Readability in Cutaneous Lymphoma

Jennifer Chen, Josiah Hanson, Oliver H. Chang, Michi M. Shinohara
{"title":"ChatGPT for Pathology Reports in Cutaneous Lymphoma: Accuracy and Readability in Cutaneous Lymphoma","authors":"Jennifer Chen,&nbsp;Josiah Hanson,&nbsp;Oliver H. Chang,&nbsp;Michi M. Shinohara","doi":"10.1002/jvc2.602","DOIUrl":null,"url":null,"abstract":"<p>Artificial Intelligence (AI) has become more powerful and more integrated in our everyday lives, including in health care. ChatGPT has been proposed as a tool to act as a virtual “health assistant”, by giving health information in “succinct, clear overviews in layman's terms” [<span>1</span>]. While AI can potentially fill in gaps in health care, there are concerns about the accuracy of information output. With limited resources and potentially difficult-to-understand information, patients with rare diseases such as primary cutaneous lymphoma may turn to ChatGPT for answers. In this study, we assessed the accuracy and readability of ChatGPT's interpretation of cutaneous lymphoma pathology reports.</p><p>We randomly chose 41 cutaneous lymphoma pathology reports from patients at the University of Washington and Fred Hutch Cancer Center. We provided ChatGPT-3.5 the final diagnoses, comments, and addendums from deidentified pathology reports with the command “Interpret this pathology diagnosis for me in layman's terms.” ChatGPT interpretations were evaluated by three dermatopathologists, and errors were classified as clinically significant or non-clinically significant based on if the error could potentially change diagnosis or management.</p><p>Out of the 41 evaluated reports, we found seven clinically significant errors and 20 non-clinically significant errors (Table 1). Examples of clinically significant errors are shown in Table 1b.</p><p>Figure 1 shows average readability scores and grade levels for original pathology reports and ChatGPT interpretations. On average, original pathology reports had a Flesch reading ease score of 16.6 ± 11.0, corresponding to a grade level of 14.9 ± 2.8, approximately a college graduate. ChatGPT interpretations had an average Flesch reading ease score of 43.5 ± 11.8, corresponding to a grade level of 12.0 ± 1.8, approximately high school graduate. The mean difference in Flesch reading ease scores between original pathology reports and ChatGPT interpretations was 26.9 [23.1–30.7] (<i>p</i> &lt; 0.01), corresponding to a decrease in grade level by 2.8 [2.0–3.7] (<i>p</i> &lt; 0.01).</p><p>We found that ChatGPT interpretation of pathology reports of cutaneous lymphomas generated errors that could impact patients' understanding of their diagnosis or management if patients relied on the ChatGPT interpretation alone. For example, a pathology report with the original diagnosis of “primary cutaneous follicular lymphoma” was interpreted by ChatGPT as “a type of cancer called follicular lymphoma that usually starts in the lymph nodes,” implying that the diagnosis is systemic rather than primary cutaneous lymphoma. Relying on this interpretation could potentially lead to additional anxiety or stress on the part of the patient and/or inappropriate treatment if relied upon by clinicians. The clinically significant error rate in our study is higher than previously reported for ChatGPT-4's interpretation of pathology reports [<span>2</span>]. Possible explanations for the discrepancy in error rates could be that the previous study included more commonly known conditions, so ChatGPT had more available information to draw from. The previous study was conducted using ChatGPT-4, which is the latest version of ChatGPT to date. Using GPT-4 it is possible that our rate of “inclusion of information not originally in the report” error rate would be lower; according to OpenAI, ChatGPT-4 scored “40% higher on tests intended to measure hallucination or fabricating facts.” [<span>3</span>] We chose to use ChatGPT-3.5 in this study because it is free and more accessible than ChatGPT-4 [<span>4</span>]; this raises the possibility that patients who can't afford more recent version of ChatGPT might be at more risk of misinformation. We are also aware that patients have access to other forms of AI to interpret their pathology reports, such as Google Bard. While we did not study the accuracy of AI models outside of ChatGPT, previous assessments of AI accuracy showed that Google Bard was more prone to hallucination errors and less accurate than ChatGPT-4 in interpreting pathology reports from multiple organ systems [<span>2</span>]. Future studies assessing discrepancies between AI algorithm models could be useful for physicians when counselling patients about AI use.</p><p>ChatGPT's interpretations were significantly easier to read, and all three of our dermatopathologists agreed that the ChatGPT reorganised information in a more digestible way. However, even these easier to read ChatGPT interpretations were still too complex for the average US reading level, leading to disparities in access to information generated with AI [<span>5-7</span>].</p><p>In summary, while ChatGPT has the potential to increase accessibility of medical information for patients, its use in interpreting complex medical data, such as pathology reports, presents significant risks due to potential errors. It is essential that healthcare providers remain aware of these limitations and continue to validate AI-generated information before it is relied upon by patients or in the clinical setting.</p><p><b>Jennifer Chen:</b> conceptualisation, data curation, formal analysis, investigation, methodology, writing–original draft, writing–review and editing. <b>Josiah Hanson:</b> formal analysis, writing–review and editing. <b>Oliver H. Chang:</b> formal analysis, writing–review and editing. <b>Michi M. Shinohara:</b> conceptualisation, data curation, formal analysis, supervision, writing–review and editing.</p><p>No patient identifying details are presented in this study. Study protocol was in accordance with the ethical standards of the University of Washington IRB and with the Helsinki Declaration of 1975, as revised in 1983.</p><p>The authors declare no conflicts of interest.</p>","PeriodicalId":94325,"journal":{"name":"JEADV clinical practice","volume":"4 2","pages":"561-563"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jvc2.602","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JEADV clinical practice","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jvc2.602","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Artificial Intelligence (AI) has become more powerful and more integrated in our everyday lives, including in health care. ChatGPT has been proposed as a tool to act as a virtual “health assistant”, by giving health information in “succinct, clear overviews in layman's terms” [1]. While AI can potentially fill in gaps in health care, there are concerns about the accuracy of information output. With limited resources and potentially difficult-to-understand information, patients with rare diseases such as primary cutaneous lymphoma may turn to ChatGPT for answers. In this study, we assessed the accuracy and readability of ChatGPT's interpretation of cutaneous lymphoma pathology reports.

We randomly chose 41 cutaneous lymphoma pathology reports from patients at the University of Washington and Fred Hutch Cancer Center. We provided ChatGPT-3.5 the final diagnoses, comments, and addendums from deidentified pathology reports with the command “Interpret this pathology diagnosis for me in layman's terms.” ChatGPT interpretations were evaluated by three dermatopathologists, and errors were classified as clinically significant or non-clinically significant based on if the error could potentially change diagnosis or management.

Out of the 41 evaluated reports, we found seven clinically significant errors and 20 non-clinically significant errors (Table 1). Examples of clinically significant errors are shown in Table 1b.

Figure 1 shows average readability scores and grade levels for original pathology reports and ChatGPT interpretations. On average, original pathology reports had a Flesch reading ease score of 16.6 ± 11.0, corresponding to a grade level of 14.9 ± 2.8, approximately a college graduate. ChatGPT interpretations had an average Flesch reading ease score of 43.5 ± 11.8, corresponding to a grade level of 12.0 ± 1.8, approximately high school graduate. The mean difference in Flesch reading ease scores between original pathology reports and ChatGPT interpretations was 26.9 [23.1–30.7] (p < 0.01), corresponding to a decrease in grade level by 2.8 [2.0–3.7] (p < 0.01).

We found that ChatGPT interpretation of pathology reports of cutaneous lymphomas generated errors that could impact patients' understanding of their diagnosis or management if patients relied on the ChatGPT interpretation alone. For example, a pathology report with the original diagnosis of “primary cutaneous follicular lymphoma” was interpreted by ChatGPT as “a type of cancer called follicular lymphoma that usually starts in the lymph nodes,” implying that the diagnosis is systemic rather than primary cutaneous lymphoma. Relying on this interpretation could potentially lead to additional anxiety or stress on the part of the patient and/or inappropriate treatment if relied upon by clinicians. The clinically significant error rate in our study is higher than previously reported for ChatGPT-4's interpretation of pathology reports [2]. Possible explanations for the discrepancy in error rates could be that the previous study included more commonly known conditions, so ChatGPT had more available information to draw from. The previous study was conducted using ChatGPT-4, which is the latest version of ChatGPT to date. Using GPT-4 it is possible that our rate of “inclusion of information not originally in the report” error rate would be lower; according to OpenAI, ChatGPT-4 scored “40% higher on tests intended to measure hallucination or fabricating facts.” [3] We chose to use ChatGPT-3.5 in this study because it is free and more accessible than ChatGPT-4 [4]; this raises the possibility that patients who can't afford more recent version of ChatGPT might be at more risk of misinformation. We are also aware that patients have access to other forms of AI to interpret their pathology reports, such as Google Bard. While we did not study the accuracy of AI models outside of ChatGPT, previous assessments of AI accuracy showed that Google Bard was more prone to hallucination errors and less accurate than ChatGPT-4 in interpreting pathology reports from multiple organ systems [2]. Future studies assessing discrepancies between AI algorithm models could be useful for physicians when counselling patients about AI use.

ChatGPT's interpretations were significantly easier to read, and all three of our dermatopathologists agreed that the ChatGPT reorganised information in a more digestible way. However, even these easier to read ChatGPT interpretations were still too complex for the average US reading level, leading to disparities in access to information generated with AI [5-7].

In summary, while ChatGPT has the potential to increase accessibility of medical information for patients, its use in interpreting complex medical data, such as pathology reports, presents significant risks due to potential errors. It is essential that healthcare providers remain aware of these limitations and continue to validate AI-generated information before it is relied upon by patients or in the clinical setting.

Jennifer Chen: conceptualisation, data curation, formal analysis, investigation, methodology, writing–original draft, writing–review and editing. Josiah Hanson: formal analysis, writing–review and editing. Oliver H. Chang: formal analysis, writing–review and editing. Michi M. Shinohara: conceptualisation, data curation, formal analysis, supervision, writing–review and editing.

No patient identifying details are presented in this study. Study protocol was in accordance with the ethical standards of the University of Washington IRB and with the Helsinki Declaration of 1975, as revised in 1983.

The authors declare no conflicts of interest.

皮肤淋巴瘤病理报告的ChatGPT:准确性和可读性
人工智能(AI)已经变得越来越强大,越来越融入我们的日常生活,包括医疗保健领域。ChatGPT被提议作为一种虚拟的“健康助手”,通过以“外行术语简洁、清晰的概述”提供健康信息。虽然人工智能有可能填补医疗保健领域的空白,但人们对信息输出的准确性表示担忧。由于资源有限和潜在的难以理解的信息,患有罕见疾病(如原发性皮肤淋巴瘤)的患者可能会转向ChatGPT寻求答案。在这项研究中,我们评估了ChatGPT对皮肤淋巴瘤病理报告解释的准确性和可读性。我们随机选择了41例来自华盛顿大学和Fred Hutch癌症中心的皮肤淋巴瘤病理报告。我们提供了ChatGPT-3.5的最终诊断、评论和来自未识别病理报告的附录,并命令“用外行的术语为我解释这个病理诊断”。ChatGPT解释由三位皮肤病理学家进行评估,并根据错误是否可能改变诊断或管理,将错误分为临床显著或非临床显著。在41份评估报告中,我们发现7个临床显著性错误和20个非临床显著性错误(表1)。临床显著错误的例子见表1b。图1显示了原始病理报告和ChatGPT解释的平均可读性分数和等级水平。原始病理报告的Flesch阅读难度评分平均为16.6±11.0,对应的年级水平为14.9±2.8,大致为大学毕业生。ChatGPT口译的平均Flesch阅读轻松得分为43.5±11.8,对应的年级水平为12.0±1.8,大致为高中毕业生。原始病理报告与ChatGPT解读的Flesch阅读轻松得分平均差值为26.9 [23.1-30.7](p &lt; 0.01),对应年级水平下降2.8 [2.0-3.7](p &lt; 0.01)。我们发现,对皮肤淋巴瘤病理报告的ChatGPT解释会产生错误,如果患者仅依赖ChatGPT解释,可能会影响患者对其诊断或治疗的理解。例如,一份最初诊断为“原发性皮肤滤泡性淋巴瘤”的病理报告被ChatGPT解释为“一种称为滤泡性淋巴瘤的癌症,通常始于淋巴结”,这意味着该诊断是全身性的,而不是原发性皮肤淋巴瘤。依赖于这种解释可能会导致患者额外的焦虑或压力和/或不适当的治疗,如果由临床医生依赖。在我们的研究中,ChatGPT-4对病理报告的解释的临床显著错误率比先前报道的要高。对错误率差异的可能解释是,之前的研究包含了更常见的情况,因此ChatGPT有更多可用的信息可以从中提取。之前的研究是使用ChatGPT-4进行的,这是迄今为止最新版本的ChatGPT。使用GPT-4,我们的“包含报告中未包含的信息”错误率可能会更低;根据OpenAI, ChatGPT-4在测量幻觉或捏造事实的测试中得分高出40%。“[3]我们在这项研究中选择使用ChatGPT-3.5,因为它是免费的,比ChatGPT-4更容易获得[4];这就增加了一种可能性,即那些买不起最新版本ChatGPT的患者可能更容易受到错误信息的影响。我们也意识到,患者可以使用其他形式的人工智能来解释他们的病理报告,比如b谷歌巴德。虽然我们没有研究ChatGPT之外的人工智能模型的准确性,但之前对人工智能准确性的评估表明,b谷歌巴德在解释多器官系统病理报告[2]方面比ChatGPT-4更容易出现幻觉错误,准确性更低。未来的研究评估人工智能算法模型之间的差异,可能对医生在向患者咨询人工智能的使用时有用。ChatGPT的解释明显更容易阅读,我们的三位皮肤病理学家都同意ChatGPT以更容易消化的方式重组信息。然而,即使这些更容易阅读的ChatGPT解释对于美国人的平均阅读水平来说仍然过于复杂,导致在获取人工智能生成的信息方面存在差异[5-7]。总之,虽然ChatGPT有可能增加患者对医疗信息的可访问性,但在解释复杂的医疗数据(如病理报告)时,由于潜在的错误,它存在重大风险。 医疗保健提供者必须意识到这些局限性,并在患者或临床环境依赖人工智能生成的信息之前继续对其进行验证。Jennifer Chen:概念化,数据管理,形式分析,调查,方法论,写作-原稿,写作-审查和编辑。乔赛亚·汉森:形式分析、写作评论和编辑。Oliver H. Chang:形式分析,写作评论和编辑。Michi M. Shinohara:概念化,数据管理,形式分析,监督,写作-审查和编辑。在这项研究中没有提供患者识别细节。研究方案符合华盛顿大学伦理委员会的道德标准和1983年修订的1975年《赫尔辛基宣言》。作者声明无利益冲突。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
0.30
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信