Deep sight: enhancing periprocedural adverse event recording in endoscopy by structuring text documentation with privacy-preserving large language models

iGIE Pub Date : 2024-12-01 DOI:10.1016/j.igie.2024.08.001

Isabella C. Wiest MD, MSc , Dyke Ferber MD , Stefan Wittlinger MSc , Matthias P. Ebert MD , Sebastian Belle MD , Jakob Nikolas Kather MD, MSc

{"title":"Deep sight: enhancing periprocedural adverse event recording in endoscopy by structuring text documentation with privacy-preserving large language models","authors":"Isabella C. Wiest MD, MSc , Dyke Ferber MD , Stefan Wittlinger MSc , Matthias P. Ebert MD , Sebastian Belle MD , Jakob Nikolas Kather MD, MSc","doi":"10.1016/j.igie.2024.08.001","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Aims</h3><div>The assessment of adverse events from endoscopic procedures is essential for successful interventions, ensuring accurate follow-up, adverse event management, and processing for quality control. Despite the critical need for structured documentation, the current practice often relies on free-text recordings, which poses challenges for scalable intervention analysis; however, the introduction of large language models (LLMs) offers a promising solution by enabling the automatic extraction of adverse event details from procedural reports without altering existing documentation practices.</div></div><div><h3>Methods</h3><div>We analyzed 672 endoscopy reports, using OpenAI’s GPT-4 and Llama-2–based models to structure the data in JavaScript Object Notation for efficient analysis. We used an automated LLM pipeline to extract adverse events such as bleeding, perforation, and aspiration. The dataset was divided into a proof-of-concept set (PoC-S) with n = 171 reports, on which we explored prompt engineering to improve the performance of the models. The final analysis was run on an additional external test set of 501 reports.</div></div><div><h3>Results</h3><div>GPT-4 showed high accuracy, with a sensitivity of 97% and specificity of 92% in the PoC-S and 91% and 96%, respectively, in the test set. GPT-4 use in real-world settings is limited by privacy concerns. Conversely, Llama-2–based models, especially the Llama-2 variants fine-tuned for German language, demonstrated comparable performance (PoC-S: sensitivity of 94%; specificity of 92%, in the test set (TS): sensitivity of 89%; specificity of 93%) and offered a viable privacy-compliant alternative. The model effectiveness was further influenced by the method of prompt engineering, with experiments showing that the specificity and sensitivity could vary substantially based on the inclusion of specific prompt features, underscoring the importance of tailored prompt design.</div></div><div><h3>Conclusions</h3><div>Applying LLMs to extract structured medical information, particularly from endoscopy reports, offers an efficient, scalable, and adaptable documentation method that captures adverse events accurately with a low error rate. It facilitates immediate quality reporting and reduces manual documentation efforts.</div></div>","PeriodicalId":100652,"journal":{"name":"iGIE","volume":"3 4","pages":"Pages 447-452.e5"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"iGIE","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949708624001067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background and Aims

The assessment of adverse events from endoscopic procedures is essential for successful interventions, ensuring accurate follow-up, adverse event management, and processing for quality control. Despite the critical need for structured documentation, the current practice often relies on free-text recordings, which poses challenges for scalable intervention analysis; however, the introduction of large language models (LLMs) offers a promising solution by enabling the automatic extraction of adverse event details from procedural reports without altering existing documentation practices.

Methods

We analyzed 672 endoscopy reports, using OpenAI’s GPT-4 and Llama-2–based models to structure the data in JavaScript Object Notation for efficient analysis. We used an automated LLM pipeline to extract adverse events such as bleeding, perforation, and aspiration. The dataset was divided into a proof-of-concept set (PoC-S) with n = 171 reports, on which we explored prompt engineering to improve the performance of the models. The final analysis was run on an additional external test set of 501 reports.

Results

GPT-4 showed high accuracy, with a sensitivity of 97% and specificity of 92% in the PoC-S and 91% and 96%, respectively, in the test set. GPT-4 use in real-world settings is limited by privacy concerns. Conversely, Llama-2–based models, especially the Llama-2 variants fine-tuned for German language, demonstrated comparable performance (PoC-S: sensitivity of 94%; specificity of 92%, in the test set (TS): sensitivity of 89%; specificity of 93%) and offered a viable privacy-compliant alternative. The model effectiveness was further influenced by the method of prompt engineering, with experiments showing that the specificity and sensitivity could vary substantially based on the inclusion of specific prompt features, underscoring the importance of tailored prompt design.

Conclusions

Applying LLMs to extract structured medical information, particularly from endoscopy reports, offers an efficient, scalable, and adaptable documentation method that captures adverse events accurately with a low error rate. It facilitates immediate quality reporting and reduces manual documentation efforts.

查看原文本刊更多论文

深度视野：通过使用保护隐私的大型语言模型构建文本文档，增强内窥镜检查过程中不良事件的记录

背景和目的内镜手术不良事件的评估对于成功的干预、确保准确的随访、不良事件管理和质量控制至关重要。尽管对结构化文档的需求非常迫切，但目前的实践往往依赖于自由文本记录，这对可扩展的干预分析提出了挑战；然而，大型语言模型（llm）的引入提供了一个很有前途的解决方案，它允许从程序报告中自动提取不良事件细节，而不改变现有的文档实践。方法采用基于OpenAI的GPT-4和llama -2模型对672份内窥镜检查报告进行分析，采用JavaScript Object Notation对数据进行结构化处理，提高分析效率。我们使用自动LLM管道来提取出血、穿孔和误吸等不良事件。数据集被分成一个概念验证集（PoC-S），其中有n = 171个报告，我们在此基础上探索了提示工程以提高模型的性能。最后的分析在额外的501个报告的外部测试集上运行。结果gpt -4具有较高的准确性，在PoC-S中灵敏度为97%，特异性为92%，在测试集中灵敏度为91%，特异性为96%。GPT-4在现实环境中的使用受到隐私问题的限制。相反，基于羊驼-2的模型，特别是针对德语进行微调的羊驼-2变体，表现出相当的性能(PoC-S：灵敏度为94%；特异性为92%，在试验集（TS）中：敏感性为89%；特异性为93%)，并提供了一个可行的隐私兼容替代方案。提示工程的方法进一步影响了模型的有效性，实验表明，在包含特定提示特征的情况下，模型的特异性和敏感性可能会发生很大变化，这凸显了定制提示设计的重要性。应用llm提取结构化医疗信息，特别是从内窥镜检查报告中提取结构化医疗信息，提供了一种高效、可扩展、适应性强的记录方法，可以准确捕获不良事件，错误率低。它促进了即时的质量报告并减少了手工文档工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

iGIE

自引率

0.00%

发文量