DeIDNER Model: A Neural Network Named Entity Recognition Model for Use in the De-identification of Clinical Notes

Biomedical engineering systems and technologies, international joint conference, BIOSTEC ... revised selected papers. BIOSTEC (Conference) Pub Date : 2022-02-01 DOI:10.5220/0010884500003123

Mahanazuddin Syed, K. Sexton, M. Greer, Shorabuddin Syed, Joseph VanScoy, Farhan Kawsar, Erica Olson, Karan B. Patel, Jake Erwin, S. Bhattacharyya, M. Zozus, F. Prior

{"title":"DeIDNER Model: A Neural Network Named Entity Recognition Model for Use in the De-identification of Clinical Notes","authors":"Mahanazuddin Syed, K. Sexton, M. Greer, Shorabuddin Syed, Joseph VanScoy, Farhan Kawsar, Erica Olson, Karan B. Patel, Jake Erwin, S. Bhattacharyya, M. Zozus, F. Prior","doi":"10.5220/0010884500003123","DOIUrl":null,"url":null,"abstract":"Clinical named entity recognition (NER) is an essential building block for many downstream natural language processing (NLP) applications such as information extraction and de-identification. Recently, deep learning (DL) methods that utilize word embeddings have become popular in clinical NLP tasks. However, there has been little work on evaluating and combining the word embeddings trained from different domains. The goal of this study is to improve the performance of NER in clinical discharge summaries by developing a DL model that combines different embeddings and investigate the combination of standard and contextual embeddings from the general and clinical domains. We developed: 1) A human-annotated high-quality internal corpus with discharge summaries and 2) A NER model with an input embedding layer that combines different embeddings: standard word embeddings, context-based word embeddings, a character-level word embedding using a convolutional neural network (CNN), and an external knowledge sources along with word features as one-hot vectors. Embedding was followed by bidirectional long short-term memory (Bi-LSTM) and conditional random field (CRF) layers. The proposed model reaches or overcomes state-of-the-art performance on two publicly available data sets and an F1 score of 94.31% on an internal corpus. After incorporating mixed-domain clinically pre-trained contextual embeddings, the F1 score further improved to 95.36% on the internal corpus. This study demonstrated an efficient way of combining different embeddings that will improve the recognition performance aiding the downstream de-identification of clinical notes.","PeriodicalId":72386,"journal":{"name":"Biomedical engineering systems and technologies, international joint conference, BIOSTEC ... revised selected papers. BIOSTEC (Conference)","volume":"1 1","pages":"640-647"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical engineering systems and technologies, international joint conference, BIOSTEC ... revised selected papers. BIOSTEC (Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0010884500003123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Clinical named entity recognition (NER) is an essential building block for many downstream natural language processing (NLP) applications such as information extraction and de-identification. Recently, deep learning (DL) methods that utilize word embeddings have become popular in clinical NLP tasks. However, there has been little work on evaluating and combining the word embeddings trained from different domains. The goal of this study is to improve the performance of NER in clinical discharge summaries by developing a DL model that combines different embeddings and investigate the combination of standard and contextual embeddings from the general and clinical domains. We developed: 1) A human-annotated high-quality internal corpus with discharge summaries and 2) A NER model with an input embedding layer that combines different embeddings: standard word embeddings, context-based word embeddings, a character-level word embedding using a convolutional neural network (CNN), and an external knowledge sources along with word features as one-hot vectors. Embedding was followed by bidirectional long short-term memory (Bi-LSTM) and conditional random field (CRF) layers. The proposed model reaches or overcomes state-of-the-art performance on two publicly available data sets and an F1 score of 94.31% on an internal corpus. After incorporating mixed-domain clinically pre-trained contextual embeddings, the F1 score further improved to 95.36% on the internal corpus. This study demonstrated an efficient way of combining different embeddings that will improve the recognition performance aiding the downstream de-identification of clinical notes.

查看原文本刊更多论文

DeIDNER模型:用于临床记录去识别的神经网络命名实体识别模型

临床命名实体识别(NER)是许多下游自然语言处理(NLP)应用的重要组成部分，如信息提取和去识别。最近，利用词嵌入的深度学习(DL)方法在临床NLP任务中很受欢迎。然而，在评估和组合来自不同领域的词嵌入方面的工作很少。本研究的目标是通过开发一个结合不同嵌入的深度学习模型，并研究来自普通和临床领域的标准嵌入和上下文嵌入的组合，来提高临床出院摘要中的NER的性能。我们开发了:1)一个人工注释的高质量内部语料库和2)一个具有输入嵌入层的NER模型，该模型结合了不同的嵌入:标准词嵌入，基于上下文的词嵌入，使用卷积神经网络(CNN)的字符级词嵌入，以及一个外部知识来源以及单词特征作为单热向量。然后嵌入双向长短期记忆(Bi-LSTM)层和条件随机场(CRF)层。所提出的模型在两个公开可用的数据集上达到或克服了最先进的性能，在内部语料库上的F1得分为94.31%。在加入混合域临床预训练的上下文嵌入后，内部语料库的F1得分进一步提高到95.36%。本研究展示了一种结合不同嵌入的有效方法，可以提高识别性能，帮助临床记录的下游去识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biomedical engineering systems and technologies, international joint conference, BIOSTEC ... revised selected papers. BIOSTEC (Conference)

自引率

0.00%

发文量