Clinical de-identification using sub-document analysis and ELECTRA

Rosario Catelli, F. Gargiulo, Emanuele Damiano, M. Esposito, G. Pietro
{"title":"Clinical de-identification using sub-document analysis and ELECTRA","authors":"Rosario Catelli, F. Gargiulo, Emanuele Damiano, M. Esposito, G. Pietro","doi":"10.1109/icdh52753.2021.00050","DOIUrl":null,"url":null,"abstract":"The privacy protection mechanism in the health context is becoming a crucial task given the exponential increase in the adoption of the Electronic Health Records (EHRs) all around the world. This kind of data can be used for medical investigation and research only if it is filtered out of all the so called Protected Health Information (PHI). This paper proposes a clinical de-identification system based on deep learning techniques for Named Entity Recognition and aimed at recognizing PHI entities to be replaced by surrogates in EHRs for anonymization purposes. This system is based on ELECTRA, a recent neural language model, and is enhanced through a sub-document level analysis aimed at grouping input sentences together, through a Sentences Grouping Factor (SGF), with the aim of broadening the representation context and consequently enhancing its ability to learn. This system was experimentally tested on the official dataset distributed in 2014 by Informatics for Integrating Biology & the Bedside research group, exhibiting superior performance compared to the state of the art in terms of detection at the category level, crucial for properly substituting PHI entities with surrogates. The effectiveness of the proposed system with respect to its components has been also confirmed by a further experimental analysis performed by substituting BERT language model in place of ELECTRA and varying SGF in accordance with limitations concerning the maximum input size for the language model used.","PeriodicalId":93401,"journal":{"name":"2021 IEEE International Conference on Digital Health (ICDH)","volume":"89 1","pages":"266-275"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Digital Health (ICDH)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icdh52753.2021.00050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

The privacy protection mechanism in the health context is becoming a crucial task given the exponential increase in the adoption of the Electronic Health Records (EHRs) all around the world. This kind of data can be used for medical investigation and research only if it is filtered out of all the so called Protected Health Information (PHI). This paper proposes a clinical de-identification system based on deep learning techniques for Named Entity Recognition and aimed at recognizing PHI entities to be replaced by surrogates in EHRs for anonymization purposes. This system is based on ELECTRA, a recent neural language model, and is enhanced through a sub-document level analysis aimed at grouping input sentences together, through a Sentences Grouping Factor (SGF), with the aim of broadening the representation context and consequently enhancing its ability to learn. This system was experimentally tested on the official dataset distributed in 2014 by Informatics for Integrating Biology & the Bedside research group, exhibiting superior performance compared to the state of the art in terms of detection at the category level, crucial for properly substituting PHI entities with surrogates. The effectiveness of the proposed system with respect to its components has been also confirmed by a further experimental analysis performed by substituting BERT language model in place of ELECTRA and varying SGF in accordance with limitations concerning the maximum input size for the language model used.
应用子文件分析和电子电流图进行临床去识别
鉴于电子健康记录(EHRs)在全球范围内的采用呈指数增长,健康环境中的隐私保护机制正成为一项至关重要的任务。只有将此类数据从所有所谓的受保护健康信息(PHI)中过滤出来,此类数据才能用于医学调查和研究。本文提出了一种基于命名实体识别深度学习技术的临床去识别系统,旨在识别电子病历中被替代的PHI实体,以实现匿名化目的。该系统基于一种最新的神经语言模型ELECTRA,并通过子文档级分析进行增强,该分析旨在通过句子分组因子(SGF)将输入句子分组在一起,目的是扩大表示上下文,从而提高其学习能力。该系统在2014年由整合生物学信息学和床边研究小组发布的官方数据集上进行了实验测试,在类别层面的检测方面,与目前的技术水平相比,表现出了卓越的性能,这对于用替代品正确替代PHI实体至关重要。通过用BERT语言模型代替ELECTRA,并根据所使用的语言模型的最大输入大小限制改变SGF,进一步的实验分析也证实了所提出的系统在其组件方面的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信