Model tuning or prompt Tuning? a study of large language models for clinical concept and relation extraction

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics Pub Date : 2024-03-26 DOI:10.1016/j.jbi.2024.104630

Cheng Peng , Xi Yang , Kaleb E Smith , Zehao Yu , Aokun Chen , Jiang Bian , Yonghui Wu

{"title":"Model tuning or prompt Tuning? a study of large language models for clinical concept and relation extraction","authors":"Cheng Peng , Xi Yang , Kaleb E Smith , Zehao Yu , Aokun Chen , Jiang Bian , Yonghui Wu","doi":"10.1016/j.jbi.2024.104630","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>To develop soft prompt-based learning architecture for large language models (LLMs), examine prompt-tuning using frozen/unfrozen LLMs, and assess their abilities in transfer learning and few-shot learning.</p></div><div><h3>Methods</h3><p>We developed a soft prompt-based learning architecture and compared 4 strategies including (1) fine-tuning without prompts; (2) hard-prompting with unfrozen LLMs; (3) soft-prompting with unfrozen LLMs; and (4) soft-prompting with frozen LLMs. We evaluated GatorTron, a clinical LLM with up to 8.9 billion parameters, and compared GatorTron with 4 existing transformer models for clinical concept and relation extraction on 2 benchmark datasets for adverse drug events and social determinants of health (SDoH). We evaluated the few-shot learning ability and generalizability for cross-institution applications.</p></div><div><h3>Results and Conclusion</h3><p>When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6 ∼ 3.1 % and 1.2 ∼ 2.9 %, respectively; GatorTron-345 M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming other two models by 0.2 ∼ 2 % and 0.6 ∼ 11.7 %, respectively. When LLMs are frozen, small LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen models. Soft prompting with a frozen GatorTron-8.9B model achieved the best performance for cross-institution evaluation. We demonstrate that (1) machines can learn soft prompts better than hard prompts composed by human, (2) frozen LLMs have good few-shot learning ability and generalizability for cross-institution applications, (3) frozen LLMs reduce computing cost to 2.5 ∼ 6 % of previous methods using unfrozen LLMs, and (4) frozen LLMs require large models (e.g., over several billions of parameters) for good performance.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0000,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046424000480","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

To develop soft prompt-based learning architecture for large language models (LLMs), examine prompt-tuning using frozen/unfrozen LLMs, and assess their abilities in transfer learning and few-shot learning.

Methods

We developed a soft prompt-based learning architecture and compared 4 strategies including (1) fine-tuning without prompts; (2) hard-prompting with unfrozen LLMs; (3) soft-prompting with unfrozen LLMs; and (4) soft-prompting with frozen LLMs. We evaluated GatorTron, a clinical LLM with up to 8.9 billion parameters, and compared GatorTron with 4 existing transformer models for clinical concept and relation extraction on 2 benchmark datasets for adverse drug events and social determinants of health (SDoH). We evaluated the few-shot learning ability and generalizability for cross-institution applications.

Results and Conclusion

When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6 ∼ 3.1 % and 1.2 ∼ 2.9 %, respectively; GatorTron-345 M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming other two models by 0.2 ∼ 2 % and 0.6 ∼ 11.7 %, respectively. When LLMs are frozen, small LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen models. Soft prompting with a frozen GatorTron-8.9B model achieved the best performance for cross-institution evaluation. We demonstrate that (1) machines can learn soft prompts better than hard prompts composed by human, (2) frozen LLMs have good few-shot learning ability and generalizability for cross-institution applications, (3) frozen LLMs reduce computing cost to 2.5 ∼ 6 % of previous methods using unfrozen LLMs, and (4) frozen LLMs require large models (e.g., over several billions of parameters) for good performance.

Abstract Image

查看原文本刊更多论文

模型调整还是提示调整？用于临床概念和关系提取的大型语言模型研究

目的：为大型语言模型（LLMs）开发基于软提示的学习架构：为大型语言模型（LLM）开发基于软提示的学习架构，研究使用冻结/解冻 LLM 的提示调整，并评估它们在迁移学习和少量学习中的能力：我们开发了一种基于软提示的学习架构，并比较了 4 种策略，包括：（1）无提示微调；（2）使用未冻结的 LLM 进行硬提示；（3）使用未冻结的 LLM 进行软提示；以及（4）使用冻结的 LLM 进行软提示。我们评估了拥有多达 89 亿个参数的临床 LLM GatorTron，并在药物不良事件和健康的社会决定因素（SDoH）这两个基准数据集上比较了 GatorTron 和现有的 4 个用于临床概念和关系提取的转换器模型。我们还评估了 GatorTron 的少量学习能力和跨机构应用的通用性：当 LLMs 被解冻时，采用软提示的 GatorTron-3.9B 在概念提取方面获得了最佳的严格 F1 分数 0.9118 和 0.8604，比传统的微调模型和基于硬提示的模型分别高出 0.6∼3.1%和1.2∼2.9%；采用软提示的GatorTron-345M在端到端关系提取方面取得了0.8332和0.7488的最佳F1分数，分别比其他两种模型高出0.2∼2%和0.6∼11.7%。当 LLM 被冻结时，小 LLM 与未冻结模型的竞争力有很大差距；将 LLM 的参数扩展到数十亿时，冻结 LLM 与未冻结模型的竞争力就会提高。在跨机构评估中，使用冻结的 GatorTron-8.9B 模型进行软提示取得了最佳性能。我们证明：(1) 机器学习软提示的效果优于由人组成的硬提示；(2) 冻结的 LLMs 具有良好的少量学习能力和跨机构应用的普适性；(3) 冻结的 LLMs 将计算成本降低到以前使用非冻结 LLMs 方法的 2.5∼6%；(4) 冻结的 LLMs 需要大型模型（例如，超过数十亿个参数）才能获得良好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Biomedical Informatics 医学-计算机：跨学科应用

CiteScore

8.90

自引率

6.70%

发文量

243

审稿时长

32 days

期刊介绍： The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.