分析 ChatGPT 对眼科病例的反应:ChatGPT 能否像眼科医生一样思考?

IF 3.2 Q1 OPHTHALMOLOGY
{"title":"分析 ChatGPT 对眼科病例的反应:ChatGPT 能否像眼科医生一样思考?","authors":"","doi":"10.1016/j.xops.2024.100600","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>Large language models such as ChatGPT have demonstrated significant potential in question-answering within ophthalmology, but there is a paucity of literature evaluating its ability to generate clinical assessments and discussions. The objectives of this study were to (1) assess the accuracy of assessment and plans generated by ChatGPT and (2) evaluate ophthalmologists’ abilities to distinguish between responses generated by clinicians versus ChatGPT.</p></div><div><h3>Design</h3><p>Cross-sectional mixed-methods study.</p></div><div><h3>Subjects</h3><p>Sixteen ophthalmologists from a single academic center, of which 10 were board-eligible and 6 were board-certified, were recruited to participate in this study.</p></div><div><h3>Methods</h3><p>Prompt engineering was used to ensure ChatGPT output discussions in the style of the ophthalmologist author of the Medical College of Wisconsin Ophthalmic Case Studies. Cases where ChatGPT accurately identified the primary diagnoses were included and then paired. Masked human-generated and ChatGPT-generated discussions were sent to participating ophthalmologists to identify the author of the discussions. Response confidence was assessed using a 5-point Likert scale score, and subjective feedback was manually reviewed.</p></div><div><h3>Main Outcome Measures</h3><p>Accuracy of ophthalmologist identification of discussion author, as well as subjective perceptions of human-generated versus ChatGPT-generated discussions.</p></div><div><h3>Results</h3><p>Overall, ChatGPT correctly identified the primary diagnosis in 15 of 17 (88.2%) cases. Two cases were excluded from the paired comparison due to hallucinations or fabrications of nonuser-provided data. Ophthalmologists correctly identified the author in 77.9% ± 26.6% of the 13 included cases, with a mean Likert scale confidence rating of 3.6 ± 1.0. No significant differences in performance or confidence were found between board-certified and board-eligible ophthalmologists. Subjectively, ophthalmologists found that discussions written by ChatGPT tended to have more generic responses, irrelevant information, hallucinated more frequently, and had distinct syntactic patterns (all <em>P</em> &lt; 0.01).</p></div><div><h3>Conclusions</h3><p>Large language models have the potential to synthesize clinical data and generate ophthalmic discussions. While these findings have exciting implications for artificial intelligence-assisted health care delivery, more rigorous real-world evaluation of these models is necessary before clinical deployment.</p></div><div><h3>Financial Disclosures</h3><p>The author(s) have no proprietary or commercial interest in any materials discussed in this article.</p></div>","PeriodicalId":74363,"journal":{"name":"Ophthalmology science","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666914524001362/pdfft?md5=1fc56cec0e121016c01c38686515b525&pid=1-s2.0-S2666914524001362-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Analysis of ChatGPT Responses to Ophthalmic Cases: Can ChatGPT Think like an Ophthalmologist?\",\"authors\":\"\",\"doi\":\"10.1016/j.xops.2024.100600\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><p>Large language models such as ChatGPT have demonstrated significant potential in question-answering within ophthalmology, but there is a paucity of literature evaluating its ability to generate clinical assessments and discussions. The objectives of this study were to (1) assess the accuracy of assessment and plans generated by ChatGPT and (2) evaluate ophthalmologists’ abilities to distinguish between responses generated by clinicians versus ChatGPT.</p></div><div><h3>Design</h3><p>Cross-sectional mixed-methods study.</p></div><div><h3>Subjects</h3><p>Sixteen ophthalmologists from a single academic center, of which 10 were board-eligible and 6 were board-certified, were recruited to participate in this study.</p></div><div><h3>Methods</h3><p>Prompt engineering was used to ensure ChatGPT output discussions in the style of the ophthalmologist author of the Medical College of Wisconsin Ophthalmic Case Studies. Cases where ChatGPT accurately identified the primary diagnoses were included and then paired. Masked human-generated and ChatGPT-generated discussions were sent to participating ophthalmologists to identify the author of the discussions. Response confidence was assessed using a 5-point Likert scale score, and subjective feedback was manually reviewed.</p></div><div><h3>Main Outcome Measures</h3><p>Accuracy of ophthalmologist identification of discussion author, as well as subjective perceptions of human-generated versus ChatGPT-generated discussions.</p></div><div><h3>Results</h3><p>Overall, ChatGPT correctly identified the primary diagnosis in 15 of 17 (88.2%) cases. Two cases were excluded from the paired comparison due to hallucinations or fabrications of nonuser-provided data. Ophthalmologists correctly identified the author in 77.9% ± 26.6% of the 13 included cases, with a mean Likert scale confidence rating of 3.6 ± 1.0. No significant differences in performance or confidence were found between board-certified and board-eligible ophthalmologists. Subjectively, ophthalmologists found that discussions written by ChatGPT tended to have more generic responses, irrelevant information, hallucinated more frequently, and had distinct syntactic patterns (all <em>P</em> &lt; 0.01).</p></div><div><h3>Conclusions</h3><p>Large language models have the potential to synthesize clinical data and generate ophthalmic discussions. While these findings have exciting implications for artificial intelligence-assisted health care delivery, more rigorous real-world evaluation of these models is necessary before clinical deployment.</p></div><div><h3>Financial Disclosures</h3><p>The author(s) have no proprietary or commercial interest in any materials discussed in this article.</p></div>\",\"PeriodicalId\":74363,\"journal\":{\"name\":\"Ophthalmology science\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-08-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666914524001362/pdfft?md5=1fc56cec0e121016c01c38686515b525&pid=1-s2.0-S2666914524001362-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ophthalmology science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666914524001362\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmology science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666914524001362","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

目的 ChatGPT 等大型语言模型在眼科问题解答方面已显示出巨大的潜力,但对其生成临床评估和讨论的能力进行评估的文献却很少。本研究的目的是:(1) 评估 ChatGPT 生成的评估和计划的准确性;(2) 评估眼科医生区分临床医生生成的回答和 ChatGPT 生成的回答的能力。受试者来自一个学术中心的 16 名眼科医生,其中 10 人有资格参加董事会,6 人获得董事会认证,他们都被招募参加这项研究。方法使用提示工程确保 ChatGPT 输出的讨论与威斯康星医学院眼科病例研究的眼科医生作者的风格一致。纳入了 ChatGPT 能准确确定主要诊断的病例,然后进行配对。将人工生成和 ChatGPT 生成的屏蔽讨论发送给参与的眼科医生,以确定讨论的作者。主要结果测量眼科医生识别讨论作者的准确性,以及对人工生成讨论和 ChatGPT 生成讨论的主观看法。结果总体而言,17 个病例中有 15 个(88.2%)的 ChatGPT 正确识别了主要诊断。由于出现幻觉或编造非用户提供的数据,有两个病例被排除在配对比较之外。在纳入的 13 个病例中,眼科医生在 77.9% ± 26.6% 的病例中正确识别了作者,平均李克特量表置信度为 3.6 ± 1.0。获得董事会认证的眼科医生与获得董事会资格的眼科医生在表现或置信度方面没有明显差异。主观上,眼科医生发现由 ChatGPT 撰写的讨论往往有更多通用的回答、不相关的信息、更频繁的幻觉以及独特的句法模式(所有 P < 0.01)。虽然这些发现对人工智能辅助医疗保健服务具有令人兴奋的意义,但在临床应用之前,有必要对这些模型进行更严格的实际评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Analysis of ChatGPT Responses to Ophthalmic Cases: Can ChatGPT Think like an Ophthalmologist?

Objective

Large language models such as ChatGPT have demonstrated significant potential in question-answering within ophthalmology, but there is a paucity of literature evaluating its ability to generate clinical assessments and discussions. The objectives of this study were to (1) assess the accuracy of assessment and plans generated by ChatGPT and (2) evaluate ophthalmologists’ abilities to distinguish between responses generated by clinicians versus ChatGPT.

Design

Cross-sectional mixed-methods study.

Subjects

Sixteen ophthalmologists from a single academic center, of which 10 were board-eligible and 6 were board-certified, were recruited to participate in this study.

Methods

Prompt engineering was used to ensure ChatGPT output discussions in the style of the ophthalmologist author of the Medical College of Wisconsin Ophthalmic Case Studies. Cases where ChatGPT accurately identified the primary diagnoses were included and then paired. Masked human-generated and ChatGPT-generated discussions were sent to participating ophthalmologists to identify the author of the discussions. Response confidence was assessed using a 5-point Likert scale score, and subjective feedback was manually reviewed.

Main Outcome Measures

Accuracy of ophthalmologist identification of discussion author, as well as subjective perceptions of human-generated versus ChatGPT-generated discussions.

Results

Overall, ChatGPT correctly identified the primary diagnosis in 15 of 17 (88.2%) cases. Two cases were excluded from the paired comparison due to hallucinations or fabrications of nonuser-provided data. Ophthalmologists correctly identified the author in 77.9% ± 26.6% of the 13 included cases, with a mean Likert scale confidence rating of 3.6 ± 1.0. No significant differences in performance or confidence were found between board-certified and board-eligible ophthalmologists. Subjectively, ophthalmologists found that discussions written by ChatGPT tended to have more generic responses, irrelevant information, hallucinated more frequently, and had distinct syntactic patterns (all P < 0.01).

Conclusions

Large language models have the potential to synthesize clinical data and generate ophthalmic discussions. While these findings have exciting implications for artificial intelligence-assisted health care delivery, more rigorous real-world evaluation of these models is necessary before clinical deployment.

Financial Disclosures

The author(s) have no proprietary or commercial interest in any materials discussed in this article.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Ophthalmology science
Ophthalmology science Ophthalmology
CiteScore
3.40
自引率
0.00%
发文量
0
审稿时长
89 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信