DE-IDENTIFICATION OF PROTECTED HEALTH INFORMATION PHI FROM FREE TEXT IN MEDICAL RECORDS

Geetha Mahadevaiah, M. Dinesh, R. Sreenivasan, S. Moin, A. Dekker
{"title":"DE-IDENTIFICATION OF PROTECTED HEALTH INFORMATION PHI FROM FREE TEXT IN MEDICAL RECORDS","authors":"Geetha Mahadevaiah, M. Dinesh, R. Sreenivasan, S. Moin, A. Dekker","doi":"10.5121/IJSPTM.2019.8201","DOIUrl":null,"url":null,"abstract":"Medical health records often contain clinical investigations results and critical information regarding patient health conditions. In these medical records, along with patient health information, patient Protected Health Information (PHI) such as names, locations and date information can co-exist. As per Health Insurance Portability and Accountability Act (HIPAA), before sharing the medical records with researchers and others, all types of PHI information needs to be de-identified. Manual de-identification through human annotators is laborious and error prone, hence, a reliable automated de-identification system is need of the hour. In this work, various state of the art techniques for de-identification of patient notes in electronic health records were analyzed for their performance, based on the performance quoted in the literature, NeuroNER was selected to de-identify Indian Radiology reports. NeuroNER is a named-entity recognition text de-identification tool developed by Massachusetts Institute of Technology (MIT). This tool is based on the Artificial Neural Networks written in Python and uses Tensorflow machine-learning framework and it comes with five pre-trained models. To test the NeuroNER models on Indian context data such as name of the person and place, 3300 medical records were simulated. Medical records were simulated by extracting clinical findings, remarks from MIMIC-III data set. For collection of all the relevant Indian data, various websites were scraped to include Indian names, Indian locations (all towns and cities), and Indian Hospital and unit names. During the testing of NeuroNER system, we observed that some of the Indian data such as name, location, etc. were not de-identified satisfactorily. To improve the performance of NeuroNER on Indian context data, along with the existing NeuroNER pre-trained model, a new pre-trained model was added to handle Indian medical reports. Medical dictionary lookup was used to reduce number of misclassifications. Results from all four pre-trained models and the model trained on Indian simulated data were concatenated and final PHI token list was generated to anonymize the medical records to obtain de-identified records. Using this approach, we improved the applicability of the NeuroNER system to Indian data and improved its efficiency and reliability. 2000 simulated reports were used for transfer learning as training set, 1000 reports were used for test set and 300 reports were used for validation (unseen) set.","PeriodicalId":103478,"journal":{"name":"International Journal of Security, Privacy and Trust Management","volume":"357 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Security, Privacy and Trust Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5121/IJSPTM.2019.8201","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Medical health records often contain clinical investigations results and critical information regarding patient health conditions. In these medical records, along with patient health information, patient Protected Health Information (PHI) such as names, locations and date information can co-exist. As per Health Insurance Portability and Accountability Act (HIPAA), before sharing the medical records with researchers and others, all types of PHI information needs to be de-identified. Manual de-identification through human annotators is laborious and error prone, hence, a reliable automated de-identification system is need of the hour. In this work, various state of the art techniques for de-identification of patient notes in electronic health records were analyzed for their performance, based on the performance quoted in the literature, NeuroNER was selected to de-identify Indian Radiology reports. NeuroNER is a named-entity recognition text de-identification tool developed by Massachusetts Institute of Technology (MIT). This tool is based on the Artificial Neural Networks written in Python and uses Tensorflow machine-learning framework and it comes with five pre-trained models. To test the NeuroNER models on Indian context data such as name of the person and place, 3300 medical records were simulated. Medical records were simulated by extracting clinical findings, remarks from MIMIC-III data set. For collection of all the relevant Indian data, various websites were scraped to include Indian names, Indian locations (all towns and cities), and Indian Hospital and unit names. During the testing of NeuroNER system, we observed that some of the Indian data such as name, location, etc. were not de-identified satisfactorily. To improve the performance of NeuroNER on Indian context data, along with the existing NeuroNER pre-trained model, a new pre-trained model was added to handle Indian medical reports. Medical dictionary lookup was used to reduce number of misclassifications. Results from all four pre-trained models and the model trained on Indian simulated data were concatenated and final PHI token list was generated to anonymize the medical records to obtain de-identified records. Using this approach, we improved the applicability of the NeuroNER system to Indian data and improved its efficiency and reliability. 2000 simulated reports were used for transfer learning as training set, 1000 reports were used for test set and 300 reports were used for validation (unseen) set.
从医疗记录中的自由文本中去识别受保护的健康信息phi
医疗健康记录通常包含临床调查结果和有关患者健康状况的关键信息。在这些医疗记录中,与患者健康信息一起,患者受保护的健康信息(PHI)(如姓名、位置和日期信息)可以共存。根据《健康保险流通与责任法案》(HIPAA),在与研究人员和其他人共享医疗记录之前,所有类型的PHI信息都需要去识别。通过人工注释器手动去标识是费力且容易出错的,因此需要一个可靠的自动去标识系统。在这项工作中,分析了各种用于在电子健康记录中去识别患者笔记的最先进技术的性能,根据文献中引用的性能,选择NeuroNER来去识别印度放射学报告。NeuroNER是由麻省理工学院开发的命名实体识别文本去识别工具。该工具基于Python编写的人工神经网络,使用Tensorflow机器学习框架,并附带五个预训练模型。为了测试NeuroNER模型对印度背景数据(如人名和地点)的影响,模拟了3300份医疗记录。通过从MIMIC-III数据集中提取临床发现、评论来模拟医疗记录。为了收集所有相关的印度数据,我们收集了各种网站,包括印度人的名字、印度人的位置(所有城镇)、印度医院和单位的名称。在NeuroNER系统的测试过程中,我们观察到一些印度数据,如姓名,位置等,并没有令人满意地去识别。为了提高NeuroNER处理印度上下文数据的性能,除了现有的NeuroNER预训练模型外,还添加了一个新的预训练模型来处理印度医疗报告。使用医学词典查找来减少错误分类的数量。将所有四种预训练模型的结果和在印度模拟数据上训练的模型进行连接,并生成最终的PHI令牌列表,以匿名化医疗记录以获得去识别记录。通过这种方法,我们提高了NeuroNER系统对印度数据的适用性,提高了其效率和可靠性。2000个模拟报告用于迁移学习作为训练集,1000个报告用于测试集,300个报告用于验证(未见)集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信