优化 ChatGPT 对谵妄评估结果的解释和报告:探索性研究。

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES
Yong K Choi, Shih-Yin Lin, Donna Marie Fick, Richard W Shulman, Sangil Lee, Priyanka Shrestha, Kate Santoso
{"title":"优化 ChatGPT 对谵妄评估结果的解释和报告:探索性研究。","authors":"Yong K Choi, Shih-Yin Lin, Donna Marie Fick, Richard W Shulman, Sangil Lee, Priyanka Shrestha, Kate Santoso","doi":"10.2196/51383","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Generative artificial intelligence (AI) and large language models, such as OpenAI's ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied.</p><p><strong>Objective: </strong>This exploratory study aims to evaluate and optimize ChatGPT's capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models' interpretation and reporting accuracy through iterative prompt optimization.</p><p><strong>Methods: </strong>We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI's processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool's criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards.</p><p><strong>Results: </strong>Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models' capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive \"Yes\" or \"No\" responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire.</p><p><strong>Conclusions: </strong>Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research.</p>","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":null,"pages":null},"PeriodicalIF":2.0000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11480687/pdf/","citationCount":"0","resultStr":"{\"title\":\"Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.\",\"authors\":\"Yong K Choi, Shih-Yin Lin, Donna Marie Fick, Richard W Shulman, Sangil Lee, Priyanka Shrestha, Kate Santoso\",\"doi\":\"10.2196/51383\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Generative artificial intelligence (AI) and large language models, such as OpenAI's ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied.</p><p><strong>Objective: </strong>This exploratory study aims to evaluate and optimize ChatGPT's capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models' interpretation and reporting accuracy through iterative prompt optimization.</p><p><strong>Methods: </strong>We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI's processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool's criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards.</p><p><strong>Results: </strong>Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models' capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive \\\"Yes\\\" or \\\"No\\\" responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire.</p><p><strong>Conclusions: </strong>Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research.</p>\",\"PeriodicalId\":14841,\"journal\":{\"name\":\"JMIR Formative Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11480687/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Formative Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/51383\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/51383","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

背景:生成式人工智能(AI)和大型语言模型(如 OpenAI 的 ChatGPT)具有庞大的知识库和自然语言处理能力,在支持医学教育和临床决策方面显示出巨大的潜力。作为一个通用的人工智能系统,ChatGPT 可以完成广泛的任务,包括无需额外培训的鉴别诊断。然而,ChatGPT 在学习和应用一系列模仿人类评估师工作流程的专业化、特定情境任务方面的具体应用,如实施标准化评估问卷,然后在标准化表格中输入评估结果,并严格按照可信的、已公布的评分标准解释评估结果等,尚未得到深入研究:本探索性研究旨在评估和优化 ChatGPT 在实施和解释基于线人的谵妄评估工具 "酸七问卷 "方面的能力。具体来说,研究目标是利用提示工程学训练 ChatGPT-3.5 和 ChatGPT-4,使其能够理解并正确地将酸七问卷应用到临床小故事中,对照人类专家的评分评估这些人工智能模型在识别和评分谵妄症状方面的表现,并通过迭代提示优化完善和提高模型的解释和报告准确性:我们使用提示工程对 ChatGPT-3.5 和 ChatGPT-4 模型进行了 "酸七问卷 "训练,这是一种通过护理人员输入来评估谵妄的工具。提示工程是一种用于增强人工智能处理输入的方法,通过精心设计提示结构来提高输出的准确性和一致性。在本研究中,提示工程包括创建特定的结构化命令,引导人工智能模型理解评估工具的标准并将其准确应用到临床案例中。这种方法还包括设计提示,明确指导人工智能如何格式化其回答,确保其符合临床文档标准:结果:尽管最初出现了不一致和错误,但 ChatGPT 模型在将酸痛七项问卷应用到小故事中时都表现出了良好的熟练度。通过迭代提示工程,性能显著提高,增强了模型检测谵妄症状和分配分数的能力。提示优化包括调整评分方法,只接受明确的 "是 "或 "否 "回答;修改评估提示,要求以表格形式回答;指导模型遵守酸七项问卷中规定的两项建议行动:我们的研究结果提供了初步证据,支持 ChatGPT 等人工智能模型在管理标准化临床评估工具方面的潜在效用。这些结果凸显了针对具体情况的培训和提示工程对于充分发挥这些人工智能模型在医疗保健应用中的潜力的重要意义。尽管取得了令人鼓舞的结果,但还需要进行更广泛的推广和在真实世界环境中的进一步验证。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.

Background: Generative artificial intelligence (AI) and large language models, such as OpenAI's ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied.

Objective: This exploratory study aims to evaluate and optimize ChatGPT's capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models' interpretation and reporting accuracy through iterative prompt optimization.

Methods: We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI's processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool's criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards.

Results: Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models' capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive "Yes" or "No" responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire.

Conclusions: Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
JMIR Formative Research
JMIR Formative Research Medicine-Medicine (miscellaneous)
CiteScore
2.70
自引率
9.10%
发文量
579
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信