开发自然语言处理 (NLP) 模型，自动提取电子健康记录中的临床数据：意大利综合中风中心的研究结果

IF 3.7 2区医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

International Journal of Medical Informatics Pub Date : 2024-09-19 DOI:10.1016/j.ijmedinf.2024.105626

Davide Badalotti , Akanksha Agrawal , Umberto Pensato , Giovanni Angelotti , Simona Marcheselli

{"title":"开发自然语言处理 (NLP) 模型，自动提取电子健康记录中的临床数据：意大利综合中风中心的研究结果","authors":"Davide Badalotti , Akanksha Agrawal , Umberto Pensato , Giovanni Angelotti , Simona Marcheselli","doi":"10.1016/j.ijmedinf.2024.105626","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.</div></div><div><h3>Materials and methods</h3><div>Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.</div></div><div><h3>Results</h3><div>The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.</div></div><div><h3>Conclusions</h3><div>We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"192 ","pages":"Article 105626"},"PeriodicalIF":3.7000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1386505624002892/pdfft?md5=ae5845eca9c8a78d1bf6334b838f5be4&pid=1-s2.0-S1386505624002892-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Development of a Natural Language Processing (NLP) model to automatically extract clinical data from electronic health records: results from an Italian comprehensive stroke center\",\"authors\":\"Davide Badalotti , Akanksha Agrawal , Umberto Pensato , Giovanni Angelotti , Simona Marcheselli\",\"doi\":\"10.1016/j.ijmedinf.2024.105626\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.</div></div><div><h3>Materials and methods</h3><div>Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.</div></div><div><h3>Results</h3><div>The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.</div></div><div><h3>Conclusions</h3><div>We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.</div></div>\",\"PeriodicalId\":54950,\"journal\":{\"name\":\"International Journal of Medical Informatics\",\"volume\":\"192 \",\"pages\":\"Article 105626\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1386505624002892/pdfft?md5=ae5845eca9c8a78d1bf6334b838f5be4&pid=1-s2.0-S1386505624002892-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1386505624002892\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624002892","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

导言数据收集通常依赖于耗时的人工输入，大量信息蕴含在患者病历和临床笔记等非结构化文本中。我们的研究旨在开发一种结合了主动学习（AL）和 NLP 技术的管道，以增强急性缺血性中风队列中的数据提取能力。使用AL对意大利NLP双向编码器变换器表征（BERT）模型进行了训练，以便从电子健康文本中自动提取临床变量。通过比较贝叶斯不确定性分歧采样（BALD）和随机文本选择，对一组代表患者合并症的标签进行了模拟主动学习性能评估。结果可能是由于 BERT 模型中的根层冻结，主动学习过程最初显示出无效性能，直到约 20% 的文本被标记，但总体而言，主动学习提高了大多数合并症的模型学习效率。预后建模结果表明，人工标注数据与半自动提取数据所训练的模型在性能上没有明显差异，这表明在这两种情况下都能有效地进行预测。在初步分析中，我们证明了该模型在提高预测模型准确性方面的潜在适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Development of a Natural Language Processing (NLP) model to automatically extract clinical data from electronic health records: results from an Italian comprehensive stroke center

Introduction

Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.

Materials and methods

Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.

Results

The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.

Conclusions

We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Medical Informatics 医学-计算机：信息系统

CiteScore

8.90

自引率

4.10%

发文量

217

审稿时长

42 days

期刊介绍： International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings. The scope of journal covers: Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.; Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc. Educational computer based programs pertaining to medical informatics or medicine in general; Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.