开发自然语言处理 (NLP) 模型,自动提取电子健康记录中的临床数据:意大利综合中风中心的研究结果

IF 3.7 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Davide Badalotti , Akanksha Agrawal , Umberto Pensato , Giovanni Angelotti , Simona Marcheselli
{"title":"开发自然语言处理 (NLP) 模型,自动提取电子健康记录中的临床数据:意大利综合中风中心的研究结果","authors":"Davide Badalotti ,&nbsp;Akanksha Agrawal ,&nbsp;Umberto Pensato ,&nbsp;Giovanni Angelotti ,&nbsp;Simona Marcheselli","doi":"10.1016/j.ijmedinf.2024.105626","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.</div></div><div><h3>Materials and methods</h3><div>Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.</div></div><div><h3>Results</h3><div>The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.</div></div><div><h3>Conclusions</h3><div>We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"192 ","pages":"Article 105626"},"PeriodicalIF":3.7000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1386505624002892/pdfft?md5=ae5845eca9c8a78d1bf6334b838f5be4&pid=1-s2.0-S1386505624002892-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Development of a Natural Language Processing (NLP) model to automatically extract clinical data from electronic health records: results from an Italian comprehensive stroke center\",\"authors\":\"Davide Badalotti ,&nbsp;Akanksha Agrawal ,&nbsp;Umberto Pensato ,&nbsp;Giovanni Angelotti ,&nbsp;Simona Marcheselli\",\"doi\":\"10.1016/j.ijmedinf.2024.105626\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.</div></div><div><h3>Materials and methods</h3><div>Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.</div></div><div><h3>Results</h3><div>The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.</div></div><div><h3>Conclusions</h3><div>We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.</div></div>\",\"PeriodicalId\":54950,\"journal\":{\"name\":\"International Journal of Medical Informatics\",\"volume\":\"192 \",\"pages\":\"Article 105626\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1386505624002892/pdfft?md5=ae5845eca9c8a78d1bf6334b838f5be4&pid=1-s2.0-S1386505624002892-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1386505624002892\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624002892","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

导言数据收集通常依赖于耗时的人工输入,大量信息蕴含在患者病历和临床笔记等非结构化文本中。我们的研究旨在开发一种结合了主动学习(AL)和 NLP 技术的管道,以增强急性缺血性中风队列中的数据提取能力。使用AL对意大利NLP双向编码器变换器表征(BERT)模型进行了训练,以便从电子健康文本中自动提取临床变量。通过比较贝叶斯不确定性分歧采样(BALD)和随机文本选择,对一组代表患者合并症的标签进行了模拟主动学习性能评估。结果可能是由于 BERT 模型中的根层冻结,主动学习过程最初显示出无效性能,直到约 20% 的文本被标记,但总体而言,主动学习提高了大多数合并症的模型学习效率。预后建模结果表明,人工标注数据与半自动提取数据所训练的模型在性能上没有明显差异,这表明在这两种情况下都能有效地进行预测。在初步分析中,我们证明了该模型在提高预测模型准确性方面的潜在适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Development of a Natural Language Processing (NLP) model to automatically extract clinical data from electronic health records: results from an Italian comprehensive stroke center

Introduction

Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.

Materials and methods

Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.

Results

The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.

Conclusions

We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
International Journal of Medical Informatics
International Journal of Medical Informatics 医学-计算机:信息系统
CiteScore
8.90
自引率
4.10%
发文量
217
审稿时长
42 days
期刊介绍: International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings. The scope of journal covers: Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.; Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc. Educational computer based programs pertaining to medical informatics or medicine in general; Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信