Comparison of Prompt Engineering and Fine-Tuning Strategies in Large Language Models in the Classification of Clinical Notes.

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Pub Date : 2024-05-31 eCollection Date: 2024-01-01

Xiaodan Zhang, Nabasmita Talukdar, Sandeep Vemulapalli, Sumyeong Ahn, Jiankun Wang, Han Meng, Sardar Mehtab Bin Murtaza, Dmitry Leshchiner, Aakash Ajay Dave, Dimitri F Joseph, Martin Witteveen-Lane, Dave Chesla, Jiayu Zhou, Bin Chen

{"title":"Comparison of Prompt Engineering and Fine-Tuning Strategies in Large Language Models in the Classification of Clinical Notes.","authors":"Xiaodan Zhang, Nabasmita Talukdar, Sandeep Vemulapalli, Sumyeong Ahn, Jiankun Wang, Han Meng, Sardar Mehtab Bin Murtaza, Dmitry Leshchiner, Aakash Ajay Dave, Dimitri F Joseph, Martin Witteveen-Lane, Dave Chesla, Jiayu Zhou, Bin Chen","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>The emerging large language models (LLMs) are actively evaluated in various fields including healthcare. Most studies have focused on established benchmarks and standard parameters; however, the variation and impact of prompt engineering and fine-tuning strategies have not been fully explored. This study benchmarks GPT-3.5 Turbo, GPT-4, and Llama-7B against BERT models and medical fellows' annotations in identifying patients with metastatic cancer from discharge summaries. Results revealed that clear, concise prompts incorporating reasoning steps significantly enhanced performance. GPT-4 exhibited superior performance among all models. Notably, one-shot learning and fine-tuning provided no incremental benefit. The model's accuracy sustained even when keywords for metastatic cancer were removed or when half of the input tokens were randomly discarded. These findings underscore GPT-4's potential to substitute specialized models, such as PubMedBERT, through strategic prompt engineering, and suggest opportunities to improve open-source models, which are better suited to use in clinical settings.</p>","PeriodicalId":72181,"journal":{"name":"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science","volume":"2024 ","pages":"478-487"},"PeriodicalIF":0.0000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11141826/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The emerging large language models (LLMs) are actively evaluated in various fields including healthcare. Most studies have focused on established benchmarks and standard parameters; however, the variation and impact of prompt engineering and fine-tuning strategies have not been fully explored. This study benchmarks GPT-3.5 Turbo, GPT-4, and Llama-7B against BERT models and medical fellows' annotations in identifying patients with metastatic cancer from discharge summaries. Results revealed that clear, concise prompts incorporating reasoning steps significantly enhanced performance. GPT-4 exhibited superior performance among all models. Notably, one-shot learning and fine-tuning provided no incremental benefit. The model's accuracy sustained even when keywords for metastatic cancer were removed or when half of the input tokens were randomly discarded. These findings underscore GPT-4's potential to substitute specialized models, such as PubMedBERT, through strategic prompt engineering, and suggest opportunities to improve open-source models, which are better suited to use in clinical settings.

本刊更多论文

比较临床笔记分类中大型语言模型的提示工程和微调策略

新兴的大型语言模型（LLM）在包括医疗保健在内的各个领域都得到了积极的评估。大多数研究都集中在既定基准和标准参数上，但尚未充分探讨提示工程和微调策略的变化和影响。本研究以 GPT-3.5 Turbo、GPT-4 和 Llama-7B 为基准，对照 BERT 模型和医学研究员的注释，从出院摘要中识别转移性癌症患者。结果表明，包含推理步骤的清晰简洁的提示大大提高了性能。在所有模型中，GPT-4 表现出更优越的性能。值得注意的是，单次学习和微调并没有带来增益。即使删除转移性癌症的关键词或随机丢弃一半的输入标记，该模型的准确性也能保持不变。这些发现强调了 GPT-4 的潜力，它可以通过战略性的提示工程取代 PubMedBERT 等专业模型，并为改进开源模型提供了机会，因为开源模型更适合在临床环境中使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

自引率

0.00%

发文量