Assessment of Pathology Domain-Specific Knowledge of ChatGPT and Comparison to Human Performance.

Archives of pathology & laboratory medicine Pub Date : 2024-10-01 DOI:10.5858/arpa.2023-0296-OA

Andrew Y Wang, Sherman Lin, Christopher Tran, Robert J Homer, Dan Wilsdon, Joanna C Walsh, Emily A Goebel, Irene Sansano, Snehal Sonawane, Vincent Cockenpot, Sanjay Mukhopadhyay, Toros Taskin, Nusrat Zahra, Luca Cima, Orhan Semerci, Birsen Gizem Özamrak, Pallavi Mishra, Naga Sarika Vennavalli, Po-Hsuan Cameron Chen, Matthew J Cecchini

{"title":"Assessment of Pathology Domain-Specific Knowledge of ChatGPT and Comparison to Human Performance.","authors":"Andrew Y Wang, Sherman Lin, Christopher Tran, Robert J Homer, Dan Wilsdon, Joanna C Walsh, Emily A Goebel, Irene Sansano, Snehal Sonawane, Vincent Cockenpot, Sanjay Mukhopadhyay, Toros Taskin, Nusrat Zahra, Luca Cima, Orhan Semerci, Birsen Gizem Özamrak, Pallavi Mishra, Naga Sarika Vennavalli, Po-Hsuan Cameron Chen, Matthew J Cecchini","doi":"10.5858/arpa.2023-0296-OA","DOIUrl":null,"url":null,"abstract":"Context.—: Artificial intelligence algorithms hold the potential to fundamentally change many aspects of society. Application of these tools, including the publicly available ChatGPT, has demonstrated impressive domain-specific knowledge in many areas, including medicine.Objectives.—: To understand the level of pathology domain-specific knowledge for ChatGPT using different underlying large language models, GPT-3.5 and the updated GPT-4.Design.—: An international group of pathologists (n = 15) was recruited to generate pathology-specific questions at a similar level to those that could be seen on licensing (board) examinations. The questions (n = 15) were answered by GPT-3.5, GPT-4, and a staff pathologist who recently passed their Canadian pathology licensing exams. Participants were instructed to score answers on a 5-point scale and to predict which answer was written by ChatGPT.Results.—: GPT-3.5 performed at a similar level to the staff pathologist, while GPT-4 outperformed both. The overall score for both GPT-3.5 and GPT-4 was within the range of meeting expectations for a trainee writing licensing examinations. In all but one question, the reviewers were able to correctly identify the answers generated by GPT-3.5.Conclusions.—: By demonstrating the ability of ChatGPT to answer pathology-specific questions at a level similar to (GPT-3.5) or exceeding (GPT-4) a trained pathologist, this study highlights the potential of large language models to be transformative in this space. In the future, more advanced iterations of these algorithms with increased domain-specific knowledge may have the potential to assist pathologists and enhance pathology resident training.","PeriodicalId":93883,"journal":{"name":"Archives of pathology & laboratory medicine","volume":" ","pages":"1152-1158"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Archives of pathology & laboratory medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5858/arpa.2023-0296-OA","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Context.—: Artificial intelligence algorithms hold the potential to fundamentally change many aspects of society. Application of these tools, including the publicly available ChatGPT, has demonstrated impressive domain-specific knowledge in many areas, including medicine.

Objectives.—: To understand the level of pathology domain-specific knowledge for ChatGPT using different underlying large language models, GPT-3.5 and the updated GPT-4.

Design.—: An international group of pathologists (n = 15) was recruited to generate pathology-specific questions at a similar level to those that could be seen on licensing (board) examinations. The questions (n = 15) were answered by GPT-3.5, GPT-4, and a staff pathologist who recently passed their Canadian pathology licensing exams. Participants were instructed to score answers on a 5-point scale and to predict which answer was written by ChatGPT.

Results.—: GPT-3.5 performed at a similar level to the staff pathologist, while GPT-4 outperformed both. The overall score for both GPT-3.5 and GPT-4 was within the range of meeting expectations for a trainee writing licensing examinations. In all but one question, the reviewers were able to correctly identify the answers generated by GPT-3.5.

Conclusions.—: By demonstrating the ability of ChatGPT to answer pathology-specific questions at a level similar to (GPT-3.5) or exceeding (GPT-4) a trained pathologist, this study highlights the potential of large language models to be transformative in this space. In the future, more advanced iterations of these algorithms with increased domain-specific knowledge may have the potential to assist pathologists and enhance pathology resident training.

查看原文本刊更多论文

评估 ChatGPT 的病理学领域特定知识并与人类表现进行比较。

背景：人工智能算法有可能从根本上改变社会的许多方面。这些工具的应用，包括公开的 ChatGPT，已经在包括医学在内的许多领域展示了令人印象深刻的特定领域知识：使用不同的底层大型语言模型（GPT-3.5 和更新版 GPT-4）了解 ChatGPT 的病理学特定领域知识水平：设计：招募了一组国际病理学家（n = 15）来生成病理学特定问题，这些问题的水平与执照（委员会）考试中可能出现的问题水平相似。这些问题（n = 15）由 GPT-3.5、GPT-4 和一位最近通过加拿大病理学执业资格考试的病理学家回答。参与者被要求按 5 分制打分，并预测哪个答案是由 ChatGPT.Results.- 撰写的：结果：GPT-3.5 的表现与病理学家相似，而 GPT-4 的表现则优于两者。GPT-3.5 和 GPT-4 的总成绩都在达到编写执业资格考试试卷的受训人员的预期范围之内。除一道题外，评审人员都能正确识别 GPT-3.5.Conclusions.- 生成的答案：通过展示 ChatGPT 回答病理学特定问题的能力，其水平接近（GPT-3.5）或超过（GPT-4）训练有素的病理学家，本研究强调了大型语言模型在这一领域的变革潜力。未来，这些算法的更高级迭代与特定领域知识的增加可能会为病理学家提供帮助，并加强病理住院医师的培训。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Archives of pathology & laboratory medicine

自引率

0.00%

发文量