Leveraging large language models for patient-ventilator asynchrony detection.

IF 4.4 Q1 HEALTH CARE SCIENCES & SERVICES

BMJ Health & Care Informatics Pub Date : 2025-06-27 DOI:10.1136/bmjhci-2024-101426

Francesc Suñol, Candelaria de Haro, Verónica Santos-Pulpón, Sol Fernández-Gonzalo, Lluís Blanch, Josefina López-Aguilar, Leonardo Sarlabous

{"title":"Leveraging large language models for patient-ventilator asynchrony detection.","authors":"Francesc Suñol, Candelaria de Haro, Verónica Santos-Pulpón, Sol Fernández-Gonzalo, Lluís Blanch, Josefina López-Aguilar, Leonardo Sarlabous","doi":"10.1136/bmjhci-2024-101426","DOIUrl":null,"url":null,"abstract":"Objectives: The objective of this study is to evaluate whether large language models (LLMs) can achieve performance comparable to expert-developed deep neural networks in detecting flow starvation (FS) asynchronies during mechanical ventilation.Methods: Popular LLMs (GPT-4, Claude-3.5, Gemini-1.5, DeepSeek-R1) were tested on a dataset of 6500 airway pressure cycles from 28 patients, classifying breaths into three FS categories. They were also tasked with generating executable code for one-dimensional convolutional neural network (CNN-1D) and Long Short-Term Memory networks. Model performances were assessed using repeated holdout validation and compared with expert-developed models.Results: LLMs performed poorly in direct FS classification (accuracy: GPT-4: 0.497; Claude-3.5: 0.627; Gemini-1.5: 0.544, DeepSeek-R1: 0.520). However, Claude-3.5-generated CNN-1D code achieved the highest accuracy (0.902 (0.899-0.906)), outperforming expert-developed models.Discussion: LLMs demonstrated limited capability in direct classification but excelled in generating effective neural network models with minimal human intervention. This suggests LLMs' potential in accelerating model development for clinical applications, particularly for detecting patient-ventilator asynchronies, though their clinical implementation requires further validation and consideration of ethical factors.","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12207101/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2024-101426","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: The objective of this study is to evaluate whether large language models (LLMs) can achieve performance comparable to expert-developed deep neural networks in detecting flow starvation (FS) asynchronies during mechanical ventilation.

Methods: Popular LLMs (GPT-4, Claude-3.5, Gemini-1.5, DeepSeek-R1) were tested on a dataset of 6500 airway pressure cycles from 28 patients, classifying breaths into three FS categories. They were also tasked with generating executable code for one-dimensional convolutional neural network (CNN-1D) and Long Short-Term Memory networks. Model performances were assessed using repeated holdout validation and compared with expert-developed models.

Results: LLMs performed poorly in direct FS classification (accuracy: GPT-4: 0.497; Claude-3.5: 0.627; Gemini-1.5: 0.544, DeepSeek-R1: 0.520). However, Claude-3.5-generated CNN-1D code achieved the highest accuracy (0.902 (0.899-0.906)), outperforming expert-developed models.

Discussion: LLMs demonstrated limited capability in direct classification but excelled in generating effective neural network models with minimal human intervention. This suggests LLMs' potential in accelerating model development for clinical applications, particularly for detecting patient-ventilator asynchronies, though their clinical implementation requires further validation and consideration of ethical factors.

Abstract Image

查看原文本刊更多论文

利用大型语言模型进行患者-呼吸机异步检测。

目的：本研究的目的是评估大型语言模型（LLMs）在检测机械通气期间的流量饥饿（FS）异步方面是否能达到与专家开发的深度神经网络相当的性能。方法：在28例患者的6500个气道压力周期数据集上测试流行的LLMs (GPT-4、Claude-3.5、Gemini-1.5、DeepSeek-R1)，将呼吸分为三种FS类别。他们还被要求为一维卷积神经网络（CNN-1D）和长短期记忆网络生成可执行代码。模型的性能评估使用重复持牌验证，并与专家开发的模型进行比较。结果：LLMs在FS直接分类中表现不佳(准确率：GPT-4: 0.497；克劳德- 3.5:0.627;Gemini-1.5: 0.544, DeepSeek-R1: 0.520)。然而，claude -3.5生成的CNN-1D代码达到了最高的精度（0.902(0.899-0.906)），优于专家开发的模型。讨论：llm在直接分类方面表现出有限的能力，但在以最少的人为干预生成有效的神经网络模型方面表现出色。这表明llm在加速临床应用模型开发方面的潜力，特别是在检测患者-呼吸机异步方面，尽管它们的临床实施需要进一步验证和考虑伦理因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊