Redaction of Protected Health Information in EHRs using CRFs and Bi-directional LSTMs

2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) Pub Date : 2018-08-01 DOI:10.1109/ICRITO.2018.8748713

Aparna Madan, Ann M. George, Apurva Singh, M. Bhatia

{"title":"Redaction of Protected Health Information in EHRs using CRFs and Bi-directional LSTMs","authors":"Aparna Madan, Ann M. George, Apurva Singh, M. Bhatia","doi":"10.1109/ICRITO.2018.8748713","DOIUrl":null,"url":null,"abstract":"This paper describes the de-identification of personally identifiable information (PIIs) in electronic health records (EHRs) using two models of conditional random fields (CRFs) and bidirectional long short term memory networks (LSTMs). Most medical records store private information such as PATIENT NAME, HOSPITAL NAME, LOCATION, etc. that needs to be de-identified or redacted before being passed on for further medical research. The process of removing such information using machine learning techniques is started with pre-processing of raw data by tokenization and detection of sentences. On comparing the techniques, it is noted that CRFs require manual feature engineering to train the model whereas LSTM is capable of handling long term dependencies without much insight about the dataset. Bi-directional LSTM network was used to generate context information from suitable word representations. Finally, a predictive layer was applied to predict the protected health information (PHI) terms having maximum probability.Evaluated with the i2b2 gold data set of clinical narratives of patients of 2014 De-identification challenge, we propose an efficient solution for redaction using two models, both of which achieve good F-scores for PHIs of all types. The LSTM-based model achieved a micro-F1 measure of 0.9592, which performs better than the CRF-based model.","PeriodicalId":439047,"journal":{"name":"2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICRITO.2018.8748713","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

This paper describes the de-identification of personally identifiable information (PIIs) in electronic health records (EHRs) using two models of conditional random fields (CRFs) and bidirectional long short term memory networks (LSTMs). Most medical records store private information such as PATIENT NAME, HOSPITAL NAME, LOCATION, etc. that needs to be de-identified or redacted before being passed on for further medical research. The process of removing such information using machine learning techniques is started with pre-processing of raw data by tokenization and detection of sentences. On comparing the techniques, it is noted that CRFs require manual feature engineering to train the model whereas LSTM is capable of handling long term dependencies without much insight about the dataset. Bi-directional LSTM network was used to generate context information from suitable word representations. Finally, a predictive layer was applied to predict the protected health information (PHI) terms having maximum probability.Evaluated with the i2b2 gold data set of clinical narratives of patients of 2014 De-identification challenge, we propose an efficient solution for redaction using two models, both of which achieve good F-scores for PHIs of all types. The LSTM-based model achieved a micro-F1 measure of 0.9592, which performs better than the CRF-based model.

查看原文本刊更多论文

使用crf和双向lstm对电子病历中受保护的健康信息进行编校

本文利用条件随机场(CRFs)和双向长短期记忆网络(LSTMs)两种模型描述了电子健康记录(EHRs)中个人身份信息(PIIs)的去识别。大多数医疗记录存储私人信息，如患者姓名、医院名称、位置等，这些信息在传递给进一步的医学研究之前需要去识别或编辑。使用机器学习技术去除此类信息的过程首先通过标记化和句子检测对原始数据进行预处理。通过比较两种技术，我们注意到crf需要手动特征工程来训练模型，而LSTM能够处理长期依赖关系，而不需要对数据集有太多的了解。使用双向LSTM网络从合适的词表示中生成上下文信息。最后，应用预测层预测具有最大概率的受保护健康信息(PHI)项。使用2014年去识别挑战患者临床叙述的i2b2金数据集进行评估，我们提出了使用两个模型的有效修订解决方案，这两个模型在所有类型的PHIs中都获得了良好的f分。基于lstm模型的微f1测度为0.9592，优于基于crf的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)

自引率

0.00%

发文量