Engineering of Generative Artificial Intelligence and Natural Language Processing Models to Accurately Identify Arrhythmia Recurrence.

IF 9.1 1区医学 Q1 CARDIAC & CARDIOVASCULAR SYSTEMS

Circulation. Arrhythmia and electrophysiology Pub Date : 2025-01-01 Epub Date: 2024-12-16 DOI:10.1161/CIRCEP.124.013023

Ruibin Feng, Kelly A Brennan, Zahra Azizi, Jatin Goyal, Brototo Deb, Hui Ju Chang, Prasanth Ganesan, Paul Clopton, Maxime Pedron, Samuel Ruipérez-Campillo, Yaanik B Desai, Hugo De Larochellière, Tina Baykaner, Marco V Perez, Miguel Rodrigo, Albert J Rogers, Sanjiv M Narayan

{"title":"Engineering of Generative Artificial Intelligence and Natural Language Processing Models to Accurately Identify Arrhythmia Recurrence.","authors":"Ruibin Feng, Kelly A Brennan, Zahra Azizi, Jatin Goyal, Brototo Deb, Hui Ju Chang, Prasanth Ganesan, Paul Clopton, Maxime Pedron, Samuel Ruipérez-Campillo, Yaanik B Desai, Hugo De Larochellière, Tina Baykaner, Marco V Perez, Miguel Rodrigo, Albert J Rogers, Sanjiv M Narayan","doi":"10.1161/CIRCEP.124.013023","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) such as Chat Generative Pre-trained Transformer (ChatGPT) excel at interpreting unstructured data from public sources, yet are limited when responding to queries on private repositories, such as electronic health records (EHRs). We hypothesized that prompt engineering could enhance the accuracy of LLMs for interpreting EHR data without requiring domain knowledge, thus expanding their utility for patients and personalized diagnostics.Methods: We designed and systematically tested prompt engineering techniques to improve the ability of LLMs to interpret EHRs for nuanced diagnostic questions, referenced to a panel of medical experts. In 490 full-text EHR notes from 125 patients with prior life-threatening heart rhythm disorders, we asked GPT-4-turbo to identify recurrent arrhythmias distinct from prior events and tested 220 563 queries. To provide context, results were compared with rule-based natural language processing and Bidirectional Encoder Representations from Transformer-based language models. Experiments were repeated for 2 additional LLMs.Results: In an independent hold-out set of 389 notes, GPT-4-turbo had a balanced accuracy of 64.3%±4.7% out-of-the-box at baseline. This increased when asking GPT-4-turbo to provide a rationale for its answers, a structured data output, and in-context exemplars, to a balanced accuracy of 91.4%±3.8% (P<0.05). This surpassed the traditional logic-based natural language processing and BERT-based models (P<0.05). Results were consistent for GPT-3.5-turbo and Jurassic-2 LLMs.Conclusions: The use of prompt engineering strategies enables LLMs to identify clinical end points from EHRs with an accuracy that surpassed natural language processing and approximated experts, yet without the need for expert knowledge. These approaches could be applied to LLM queries for other domains, to facilitate automated analysis of nuanced data sets with high accuracy by nonexperts.","PeriodicalId":10319,"journal":{"name":"Circulation. Arrhythmia and electrophysiology","volume":" ","pages":"e013023"},"PeriodicalIF":9.1000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11771986/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Circulation. Arrhythmia and electrophysiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1161/CIRCEP.124.013023","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/16 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) such as Chat Generative Pre-trained Transformer (ChatGPT) excel at interpreting unstructured data from public sources, yet are limited when responding to queries on private repositories, such as electronic health records (EHRs). We hypothesized that prompt engineering could enhance the accuracy of LLMs for interpreting EHR data without requiring domain knowledge, thus expanding their utility for patients and personalized diagnostics.

Methods: We designed and systematically tested prompt engineering techniques to improve the ability of LLMs to interpret EHRs for nuanced diagnostic questions, referenced to a panel of medical experts. In 490 full-text EHR notes from 125 patients with prior life-threatening heart rhythm disorders, we asked GPT-4-turbo to identify recurrent arrhythmias distinct from prior events and tested 220 563 queries. To provide context, results were compared with rule-based natural language processing and Bidirectional Encoder Representations from Transformer-based language models. Experiments were repeated for 2 additional LLMs.

Results: In an independent hold-out set of 389 notes, GPT-4-turbo had a balanced accuracy of 64.3%±4.7% out-of-the-box at baseline. This increased when asking GPT-4-turbo to provide a rationale for its answers, a structured data output, and in-context exemplars, to a balanced accuracy of 91.4%±3.8% (P<0.05). This surpassed the traditional logic-based natural language processing and BERT-based models (P<0.05). Results were consistent for GPT-3.5-turbo and Jurassic-2 LLMs.

Conclusions: The use of prompt engineering strategies enables LLMs to identify clinical end points from EHRs with an accuracy that surpassed natural language processing and approximated experts, yet without the need for expert knowledge. These approaches could be applied to LLM queries for other domains, to facilitate automated analysis of nuanced data sets with high accuracy by nonexperts.

查看原文本刊更多论文

设计生成式人工智能和自然语言处理模型，准确识别心律失常复发。

背景：大型语言模型（llm），如ChatGPT，擅长解释来自公共来源的非结构化数据，但在响应私人存储库（如电子健康记录（EHRs））上的查询时受到限制。我们假设，即时工程可以提高llm在解释EHR数据时的准确性，而不需要领域知识，从而扩大其对患者和个性化诊断的效用。方法：我们设计并系统地测试了提示工程技术，以提高法学硕士解释细微诊断问题的电子病历的能力，参考了一个医学专家小组。在125例既往有危及生命的心律失常患者的490份全文电子病历记录中，我们要求GPT-4-turbo识别与既往事件不同的复发性心律失常，并测试了220563个查询。为了提供上下文，将结果与基于规则的自然语言处理和基于bert的语言模型进行了比较。对另外2个llm重复实验。结果：在一组独立的389个音符中，GPT-4-turbo在基线时的平衡精度为64.3%±4.7%。当要求GPT-4-turbo提供答案的基本原理，需要结构化的数据输出，并提供上下文示例时，这一比例增加了，达到91.4%±3.8%的平衡精度(pp结论：使用快速工程策略使法学硕士能够从电子健康记录中识别临床终点，其精度超过自然语言处理和接近专家，但不需要专家知识。这些方法可以应用于其他领域的法学硕士查询，以促进非专家对细微差别数据集的高精度自动分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Circulation. Arrhythmia and electrophysiology 医学-心血管系统

CiteScore

13.70

自引率

4.80%

发文量

187

审稿时长

4-8 weeks

期刊介绍： Circulation: Arrhythmia and Electrophysiology is a journal dedicated to the study and application of clinical cardiac electrophysiology. It covers a wide range of topics including the diagnosis and treatment of cardiac arrhythmias, as well as research in this field. The journal accepts various types of studies, including observational research, clinical trials, epidemiological studies, and advancements in translational research.