Deidentification of free-text medical records using pre-trained bidirectional transformers.

Proceedings of the ACM Conference on Health, Inference, and Learning Pub Date : 2020-04-01 Epub Date: 2020-04-02 DOI:10.1145/3368555.3384455

Alistair E W Johnson, Lucas Bulgarelli, Tom J Pollard

{"title":"Deidentification of free-text medical records using pre-trained bidirectional transformers.","authors":"Alistair E W Johnson, Lucas Bulgarelli, Tom J Pollard","doi":"10.1145/3368555.3384455","DOIUrl":null,"url":null,"abstract":"<p><p>The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification, the process of removing identifiers, is challenging, however. High-quality annotated data for developing models is scarce; many target identifiers are highly heterogenous (for example, there are uncountable variations of patient names); and in practice anything less than perfect sensitivity may be considered a failure. As a result, patient data is often withheld when sharing would be beneficial, and identifiable patient data is often divulged when a deidentified version would suffice. In recent years, advances in machine learning methods have led to rapid performance improvements in natural language processing tasks, in particular with the advent of large-scale pretrained language models. In this paper we develop and evaluate an approach for deidentification of clinical notes based on a bidirectional transformer model. We propose human interpretable evaluation measures and demonstrate state of the art performance against modern baseline models. Finally, we highlight current challenges in deidentification, including the absence of clear annotation guidelines, lack of portability of models, and paucity of training data. Code to develop our model is open source, allowing for broad reuse.</p>","PeriodicalId":87342,"journal":{"name":"Proceedings of the ACM Conference on Health, Inference, and Learning","volume":"2020 ","pages":"214-221"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/e7/a6/nihms-1679580.PMC8330601.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Conference on Health, Inference, and Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368555.3384455","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/4/2 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification, the process of removing identifiers, is challenging, however. High-quality annotated data for developing models is scarce; many target identifiers are highly heterogenous (for example, there are uncountable variations of patient names); and in practice anything less than perfect sensitivity may be considered a failure. As a result, patient data is often withheld when sharing would be beneficial, and identifiable patient data is often divulged when a deidentified version would suffice. In recent years, advances in machine learning methods have led to rapid performance improvements in natural language processing tasks, in particular with the advent of large-scale pretrained language models. In this paper we develop and evaluate an approach for deidentification of clinical notes based on a bidirectional transformer model. We propose human interpretable evaluation measures and demonstrate state of the art performance against modern baseline models. Finally, we highlight current challenges in deidentification, including the absence of clear annotation guidelines, lack of portability of models, and paucity of training data. Code to develop our model is open source, allowing for broad reuse.

Abstract Image

查看原文本刊更多论文

使用预训练的双向变换器消除自由文本医疗记录的身份识别。

护理人员和研究人员共享患者数据的能力是临床实践和生物医学研究许多领域的基础。在共享之前，为了保护患者隐私，通常有必要删除姓名、联系方式和日期等标识符。然而，去标识化，即去除标识符的过程，是一项具有挑战性的工作。用于开发模型的高质量注释数据非常稀缺；许多目标标识符具有高度异质性（例如，患者姓名的变化难以计数）；在实践中，任何不够完美的灵敏度都可能被视为失败。因此，当共享病人数据有益时，病人数据却往往被隐瞒；当去标识化版本就足够时，可识别的病人数据却往往被泄露。近年来，机器学习方法的进步使自然语言处理任务的性能迅速提高，特别是随着大规模预训练语言模型的出现。在本文中，我们开发并评估了一种基于双向转换器模型的临床笔记去标识化方法。我们提出了人类可解释的评估指标，并展示了与现代基线模型相比的最新性能。最后，我们强调了当前去标识化面临的挑战，包括缺乏明确的注释指南、模型缺乏可移植性以及训练数据匮乏。开发我们模型的代码是开源的，允许广泛重用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM Conference on Health, Inference, and Learning

自引率

0.00%

发文量