The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text

Hanna Berg, Aron Henriksson, H. Dalianis
{"title":"The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text","authors":"Hanna Berg, Aron Henriksson, H. Dalianis","doi":"10.18653/v1/2020.louhi-1.1","DOIUrl":null,"url":null,"abstract":"The impact of de-identification on data quality and, in particular, utility for developing models for downstream tasks has been more thoroughly studied for structured data than for unstructured text. While previous studies indicate that text de-identification has a limited impact on models for downstream tasks, it remains unclear what the impact is with various levels and forms of de-identification, in particular concerning the trade-off between precision and recall. In this paper, the impact of de-identification is studied on downstream named entity recognition in Swedish clinical text. The results indicate that de-identification models with moderate to high precision lead to similar downstream performance, while low precision has a substantial negative impact. Furthermore, different strategies for concealing sensitive information affect performance to different degrees, ranging from pseudonymisation having a low impact to the removal of entire sentences with sensitive information having a high impact. This study indicates that it is possible to increase the recall of models for identifying sensitive information without negatively affecting the use of de-identified text data for training models for clinical named entity recognition; however, there is ultimately a trade-off between the level of de-identification and the subsequent utility of the data.","PeriodicalId":448872,"journal":{"name":"International Workshop on Health Text Mining and Information Analysis","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on Health Text Mining and Information Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.louhi-1.1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

The impact of de-identification on data quality and, in particular, utility for developing models for downstream tasks has been more thoroughly studied for structured data than for unstructured text. While previous studies indicate that text de-identification has a limited impact on models for downstream tasks, it remains unclear what the impact is with various levels and forms of de-identification, in particular concerning the trade-off between precision and recall. In this paper, the impact of de-identification is studied on downstream named entity recognition in Swedish clinical text. The results indicate that de-identification models with moderate to high precision lead to similar downstream performance, while low precision has a substantial negative impact. Furthermore, different strategies for concealing sensitive information affect performance to different degrees, ranging from pseudonymisation having a low impact to the removal of entire sentences with sensitive information having a high impact. This study indicates that it is possible to increase the recall of models for identifying sensitive information without negatively affecting the use of de-identified text data for training models for clinical named entity recognition; however, there is ultimately a trade-off between the level of de-identification and the subsequent utility of the data.
去识别对临床文本下游命名实体识别的影响
去标识化对数据质量的影响,特别是对开发下游任务模型的效用,对结构化数据的研究比对非结构化文本的研究更深入。虽然以前的研究表明,文本去识别对下游任务的模型影响有限,但目前尚不清楚不同层次和形式的去识别对模型的影响,特别是关于准确性和召回率之间的权衡。在本文中,去识别研究对下游命名实体识别瑞典临床文本的影响。结果表明,中高精度的去识别模型会导致相似的下游性能,而低精度的去识别模型会产生实质性的负面影响。此外,不同的隐藏敏感信息的策略对性能的影响程度不同,从假名化影响低到删除包含敏感信息的整个句子影响高。本研究表明,在不负面影响临床命名实体识别训练模型使用去标识文本数据的情况下,有可能增加识别敏感信息模型的召回率;然而,最终在去识别级别和随后的数据效用之间存在权衡。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信