A Pipeline for Automating Emergency Medicine Documentation Using LLMs with Retrieval-Augmented Text Generation.

IF 4.3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Artificial Intelligence Pub Date : 2025-06-18 eCollection Date: 2025-01-01 DOI:10.1080/08839514.2025.2519169

Denis Moser, Matthias Bender, Murat Sariyar

{"title":"A Pipeline for Automating Emergency Medicine Documentation Using LLMs with Retrieval-Augmented Text Generation.","authors":"Denis Moser, Matthias Bender, Murat Sariyar","doi":"10.1080/08839514.2025.2519169","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate and efficient documentation of patient information is vital in emergency healthcare settings. Traditional manual documentation methods are often time-consuming and prone to errors, potentially affecting patient outcomes. Large Language Models (LLMs) offer a promising solution to enhance medical communication systems; however, their clinical deployment, particularly in non-English languages such as German, presents challenges related to content accuracy, clinical relevance, and data privacy. This study addresses these challenges by developing and evaluating an automated pipeline for emergency medical documentation in German. The research objectives include (1) generating synthetic dialogues with known ground truth data to create controlled datasets for evaluating NLP performance and (2) designing an innovative pipeline to retrieve essential clinical information from these dialogues. A subset of 100 anonymized patient records from the MIMIC-IV-ED dataset was selected, ensuring diversity in demographics, chief complaints, and conditions. A Retrieval-Augmented Generation (RAG) system extracted key nominal and numerical features using chunking, embedding, and dynamic prompts. Evaluation metrics included precision, recall, F1-score, and sentiment analysis. Initial results demonstrated high extraction accuracy, particularly in medication data (F1-scores: 86.21%-100%), though performance declined in nuanced clinical language, requiring further refinement for real-world emergency settings.</p>","PeriodicalId":8260,"journal":{"name":"Applied Artificial Intelligence","volume":"39 1","pages":"2519169"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12315831/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/08839514.2025.2519169","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate and efficient documentation of patient information is vital in emergency healthcare settings. Traditional manual documentation methods are often time-consuming and prone to errors, potentially affecting patient outcomes. Large Language Models (LLMs) offer a promising solution to enhance medical communication systems; however, their clinical deployment, particularly in non-English languages such as German, presents challenges related to content accuracy, clinical relevance, and data privacy. This study addresses these challenges by developing and evaluating an automated pipeline for emergency medical documentation in German. The research objectives include (1) generating synthetic dialogues with known ground truth data to create controlled datasets for evaluating NLP performance and (2) designing an innovative pipeline to retrieve essential clinical information from these dialogues. A subset of 100 anonymized patient records from the MIMIC-IV-ED dataset was selected, ensuring diversity in demographics, chief complaints, and conditions. A Retrieval-Augmented Generation (RAG) system extracted key nominal and numerical features using chunking, embedding, and dynamic prompts. Evaluation metrics included precision, recall, F1-score, and sentiment analysis. Initial results demonstrated high extraction accuracy, particularly in medication data (F1-scores: 86.21%-100%), though performance declined in nuanced clinical language, requiring further refinement for real-world emergency settings.

Abstract Image

查看原文本刊更多论文

利用llm和检索增强文本生成实现急诊医学文档自动化的管道。

在紧急医疗保健环境中，准确有效地记录患者信息至关重要。传统的手工记录方法通常耗时且容易出错，可能会影响患者的治疗结果。大型语言模型（LLMs）为增强医疗通信系统提供了一个有前途的解决方案；然而，它们的临床应用，特别是在德语等非英语语言中的应用，在内容准确性、临床相关性和数据隐私方面提出了挑战。本研究通过开发和评估德国紧急医疗文档的自动化管道来解决这些挑战。研究目标包括(1)使用已知的真实数据生成合成对话，以创建用于评估NLP性能的受控数据集；(2)设计一个创新的管道，从这些对话中检索必要的临床信息。从MIMIC-IV-ED数据集中选择了100个匿名患者记录的子集，确保了人口统计、主诉和病情的多样性。检索增强生成（RAG）系统使用分块、嵌入和动态提示提取关键的标称和数值特征。评估指标包括精确度、召回率、f1得分和情感分析。初步结果表明，提取的准确性很高，特别是在药物数据中（f1得分：86.21%-100%），尽管在细微的临床语言中表现有所下降，需要进一步改进以适应现实世界的紧急情况。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Artificial Intelligence 工程技术-工程：电子与电气

CiteScore

5.20

自引率

3.60%

发文量

106

审稿时长

6 months

期刊介绍： Applied Artificial Intelligence addresses concerns in applied research and applications of artificial intelligence (AI). The journal also acts as a medium for exchanging ideas and thoughts about impacts of AI research. Articles highlight advances in uses of AI systems for solving tasks in management, industry, engineering, administration, and education; evaluations of existing AI systems and tools, emphasizing comparative studies and user experiences; and the economic, social, and cultural impacts of AI. Papers on key applications, highlighting methods, time schedules, person-months needed, and other relevant material are welcome.