DeIDNER Model: A Neural Network Named Entity Recognition Model for Use in the De-identification of Clinical Notes

Mahanazuddin Syed, K. Sexton, M. Greer, Shorabuddin Syed, Joseph VanScoy, Farhan Kawsar, Erica Olson, Karan B. Patel, Jake Erwin, S. Bhattacharyya, M. Zozus, F. Prior
{"title":"DeIDNER Model: A Neural Network Named Entity Recognition Model for Use in the De-identification of Clinical Notes","authors":"Mahanazuddin Syed, K. Sexton, M. Greer, Shorabuddin Syed, Joseph VanScoy, Farhan Kawsar, Erica Olson, Karan B. Patel, Jake Erwin, S. Bhattacharyya, M. Zozus, F. Prior","doi":"10.5220/0010884500003123","DOIUrl":null,"url":null,"abstract":"Clinical named entity recognition (NER) is an essential building block for many downstream natural language processing (NLP) applications such as information extraction and de-identification. Recently, deep learning (DL) methods that utilize word embeddings have become popular in clinical NLP tasks. However, there has been little work on evaluating and combining the word embeddings trained from different domains. The goal of this study is to improve the performance of NER in clinical discharge summaries by developing a DL model that combines different embeddings and investigate the combination of standard and contextual embeddings from the general and clinical domains. We developed: 1) A human-annotated high-quality internal corpus with discharge summaries and 2) A NER model with an input embedding layer that combines different embeddings: standard word embeddings, context-based word embeddings, a character-level word embedding using a convolutional neural network (CNN), and an external knowledge sources along with word features as one-hot vectors. Embedding was followed by bidirectional long short-term memory (Bi-LSTM) and conditional random field (CRF) layers. The proposed model reaches or overcomes state-of-the-art performance on two publicly available data sets and an F1 score of 94.31% on an internal corpus. After incorporating mixed-domain clinically pre-trained contextual embeddings, the F1 score further improved to 95.36% on the internal corpus. This study demonstrated an efficient way of combining different embeddings that will improve the recognition performance aiding the downstream de-identification of clinical notes.","PeriodicalId":72386,"journal":{"name":"Biomedical engineering systems and technologies, international joint conference, BIOSTEC ... revised selected papers. BIOSTEC (Conference)","volume":"1 1","pages":"640-647"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical engineering systems and technologies, international joint conference, BIOSTEC ... revised selected papers. BIOSTEC (Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0010884500003123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Clinical named entity recognition (NER) is an essential building block for many downstream natural language processing (NLP) applications such as information extraction and de-identification. Recently, deep learning (DL) methods that utilize word embeddings have become popular in clinical NLP tasks. However, there has been little work on evaluating and combining the word embeddings trained from different domains. The goal of this study is to improve the performance of NER in clinical discharge summaries by developing a DL model that combines different embeddings and investigate the combination of standard and contextual embeddings from the general and clinical domains. We developed: 1) A human-annotated high-quality internal corpus with discharge summaries and 2) A NER model with an input embedding layer that combines different embeddings: standard word embeddings, context-based word embeddings, a character-level word embedding using a convolutional neural network (CNN), and an external knowledge sources along with word features as one-hot vectors. Embedding was followed by bidirectional long short-term memory (Bi-LSTM) and conditional random field (CRF) layers. The proposed model reaches or overcomes state-of-the-art performance on two publicly available data sets and an F1 score of 94.31% on an internal corpus. After incorporating mixed-domain clinically pre-trained contextual embeddings, the F1 score further improved to 95.36% on the internal corpus. This study demonstrated an efficient way of combining different embeddings that will improve the recognition performance aiding the downstream de-identification of clinical notes.
DeIDNER模型:用于临床记录去识别的神经网络命名实体识别模型
临床命名实体识别(NER)是许多下游自然语言处理(NLP)应用的重要组成部分,如信息提取和去识别。最近,利用词嵌入的深度学习(DL)方法在临床NLP任务中很受欢迎。然而,在评估和组合来自不同领域的词嵌入方面的工作很少。本研究的目标是通过开发一个结合不同嵌入的深度学习模型,并研究来自普通和临床领域的标准嵌入和上下文嵌入的组合,来提高临床出院摘要中的NER的性能。我们开发了:1)一个人工注释的高质量内部语料库和2)一个具有输入嵌入层的NER模型,该模型结合了不同的嵌入:标准词嵌入,基于上下文的词嵌入,使用卷积神经网络(CNN)的字符级词嵌入,以及一个外部知识来源以及单词特征作为单热向量。然后嵌入双向长短期记忆(Bi-LSTM)层和条件随机场(CRF)层。所提出的模型在两个公开可用的数据集上达到或克服了最先进的性能,在内部语料库上的F1得分为94.31%。在加入混合域临床预训练的上下文嵌入后,内部语料库的F1得分进一步提高到95.36%。本研究展示了一种结合不同嵌入的有效方法,可以提高识别性能,帮助临床记录的下游去识别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信