Structuring medication signeturs as a language regression task: comparison of zero- and few-shot GPT with fine-tuned models.

IF 2.5 Q2 HEALTH CARE SCIENCES & SERVICES

JAMIA Open Pub Date : 2024-06-18 eCollection Date: 2024-07-01 DOI:10.1093/jamiaopen/ooae051

Augusto Garcia-Agundez, Julia L Kay, Jing Li, Milena Gianfrancesco, Baljeet Rai, Angela Hu, Gabriela Schmajuk, Jinoos Yazdany

{"title":"Structuring medication signeturs as a language regression task: comparison of zero- and few-shot GPT with fine-tuned models.","authors":"Augusto Garcia-Agundez, Julia L Kay, Jing Li, Milena Gianfrancesco, Baljeet Rai, Angela Hu, Gabriela Schmajuk, Jinoos Yazdany","doi":"10.1093/jamiaopen/ooae051","DOIUrl":null,"url":null,"abstract":"Importance: Electronic health record textual sources such as medication signeturs (sigs) contain valuable information that is not always available in structured form. Commonly processed through manual annotation, this repetitive and time-consuming task could be fully automated using large language models (LLMs). While most sigs include simple instructions, some include complex patterns.Objectives: We aimed to compare the performance of GPT-3.5 and GPT-4 with smaller fine-tuned models (ClinicalBERT, BlueBERT) in extracting the average daily dose of 2 immunomodulating medications with frequent complex sigs: hydroxychloroquine, and prednisone.Methods: Using manually annotated sigs as the gold standard, we compared the performance of these models in 702 hydroxychloroquine and 22 104 prednisone prescriptions.Results: GPT-4 vastly outperformed all other models for this task at any level of in-context learning. With 100 in-context examples, the model correctly annotates 94% of hydroxychloroquine and 95% of prednisone sigs to within 1 significant digit. Error analysis conducted by 2 additional manual annotators on annotator-model disagreements suggests that the vast majority of disagreements are model errors. Many model errors relate to ambiguous sigs on which there was also frequent annotator disagreement.Discussion: Paired with minimal manual annotation, GPT-4 achieved excellent performance for language regression of complex medication sigs and vastly outperforms GPT-3.5, ClinicalBERT, and BlueBERT. However, the number of in-context examples needed to reach maximum performance was similar to GPT-3.5.Conclusion: LLMs show great potential to rapidly extract structured data from sigs in no-code fashion for clinical and research applications.","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"7 2","pages":"ooae051"},"PeriodicalIF":2.5000,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11195626/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooae051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Importance: Electronic health record textual sources such as medication signeturs (sigs) contain valuable information that is not always available in structured form. Commonly processed through manual annotation, this repetitive and time-consuming task could be fully automated using large language models (LLMs). While most sigs include simple instructions, some include complex patterns.

Objectives: We aimed to compare the performance of GPT-3.5 and GPT-4 with smaller fine-tuned models (ClinicalBERT, BlueBERT) in extracting the average daily dose of 2 immunomodulating medications with frequent complex sigs: hydroxychloroquine, and prednisone.

Methods: Using manually annotated sigs as the gold standard, we compared the performance of these models in 702 hydroxychloroquine and 22 104 prednisone prescriptions.

Results: GPT-4 vastly outperformed all other models for this task at any level of in-context learning. With 100 in-context examples, the model correctly annotates 94% of hydroxychloroquine and 95% of prednisone sigs to within 1 significant digit. Error analysis conducted by 2 additional manual annotators on annotator-model disagreements suggests that the vast majority of disagreements are model errors. Many model errors relate to ambiguous sigs on which there was also frequent annotator disagreement.

Discussion: Paired with minimal manual annotation, GPT-4 achieved excellent performance for language regression of complex medication sigs and vastly outperforms GPT-3.5, ClinicalBERT, and BlueBERT. However, the number of in-context examples needed to reach maximum performance was similar to GPT-3.5.

Conclusion: LLMs show great potential to rapidly extract structured data from sigs in no-code fashion for clinical and research applications.

查看原文本刊更多论文

以语言回归任务的形式构建药物征兆：零次和少次 GPT 与微调模型的比较。

重要性：电子健康记录文本源（如药物标识符 (sigs)）包含宝贵的信息，但这些信息并不总是以结构化的形式提供。这种重复且耗时的任务通常通过人工标注来处理，而使用大型语言模型（LLMs）则可实现完全自动化。虽然大多数 sigs 包括简单的指令，但有些也包括复杂的模式：我们的目的是比较 GPT-3.5 和 GPT-4 与较小的微调模型（ClinicalBERT、BlueBERT）在提取羟氯喹和泼尼松这两种经常出现复杂符号的免疫调节药物的日平均剂量方面的性能：方法：以人工标注的sigs作为黄金标准，我们比较了这些模型在702个羟氯喹处方和22104个泼尼松处方中的表现：结果：GPT-4 在该任务中的表现大大优于所有其他模型，无论上下文学习水平如何。在 100 个上下文示例中，该模型正确标注了 94% 的羟氯喹和 95% 的泼尼松处方，正确率均在 1 个有效数字以内。由另外两名人工标注员对标注员与模型之间的分歧进行的错误分析表明，绝大多数分歧都是模型错误造成的。许多模型错误与模棱两可的符号有关，而注释者对这些符号也经常存在分歧：讨论：GPT-4 与极少量的人工标注相配合，在复杂药物符号的语言回归方面取得了优异的成绩，远远超过了 GPT-3.5、ClinicalBERT 和 BlueBERT。然而，达到最高性能所需的上下文示例数量与 GPT-3.5 类似：LLM 在以无代码方式快速从 sigs 中提取结构化数据方面显示出巨大的潜力，适用于临床和研究应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊