Aparna Madan, Ann M. George, Apurva Singh, M. Bhatia
{"title":"使用crf和双向lstm对电子病历中受保护的健康信息进行编校","authors":"Aparna Madan, Ann M. George, Apurva Singh, M. Bhatia","doi":"10.1109/ICRITO.2018.8748713","DOIUrl":null,"url":null,"abstract":"This paper describes the de-identification of personally identifiable information (PIIs) in electronic health records (EHRs) using two models of conditional random fields (CRFs) and bidirectional long short term memory networks (LSTMs). Most medical records store private information such as PATIENT NAME, HOSPITAL NAME, LOCATION, etc. that needs to be de-identified or redacted before being passed on for further medical research. The process of removing such information using machine learning techniques is started with pre-processing of raw data by tokenization and detection of sentences. On comparing the techniques, it is noted that CRFs require manual feature engineering to train the model whereas LSTM is capable of handling long term dependencies without much insight about the dataset. Bi-directional LSTM network was used to generate context information from suitable word representations. Finally, a predictive layer was applied to predict the protected health information (PHI) terms having maximum probability.Evaluated with the i2b2 gold data set of clinical narratives of patients of 2014 De-identification challenge, we propose an efficient solution for redaction using two models, both of which achieve good F-scores for PHIs of all types. The LSTM-based model achieved a micro-F1 measure of 0.9592, which performs better than the CRF-based model.","PeriodicalId":439047,"journal":{"name":"2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Redaction of Protected Health Information in EHRs using CRFs and Bi-directional LSTMs\",\"authors\":\"Aparna Madan, Ann M. George, Apurva Singh, M. Bhatia\",\"doi\":\"10.1109/ICRITO.2018.8748713\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the de-identification of personally identifiable information (PIIs) in electronic health records (EHRs) using two models of conditional random fields (CRFs) and bidirectional long short term memory networks (LSTMs). Most medical records store private information such as PATIENT NAME, HOSPITAL NAME, LOCATION, etc. that needs to be de-identified or redacted before being passed on for further medical research. The process of removing such information using machine learning techniques is started with pre-processing of raw data by tokenization and detection of sentences. On comparing the techniques, it is noted that CRFs require manual feature engineering to train the model whereas LSTM is capable of handling long term dependencies without much insight about the dataset. Bi-directional LSTM network was used to generate context information from suitable word representations. Finally, a predictive layer was applied to predict the protected health information (PHI) terms having maximum probability.Evaluated with the i2b2 gold data set of clinical narratives of patients of 2014 De-identification challenge, we propose an efficient solution for redaction using two models, both of which achieve good F-scores for PHIs of all types. The LSTM-based model achieved a micro-F1 measure of 0.9592, which performs better than the CRF-based model.\",\"PeriodicalId\":439047,\"journal\":{\"name\":\"2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICRITO.2018.8748713\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICRITO.2018.8748713","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Redaction of Protected Health Information in EHRs using CRFs and Bi-directional LSTMs
This paper describes the de-identification of personally identifiable information (PIIs) in electronic health records (EHRs) using two models of conditional random fields (CRFs) and bidirectional long short term memory networks (LSTMs). Most medical records store private information such as PATIENT NAME, HOSPITAL NAME, LOCATION, etc. that needs to be de-identified or redacted before being passed on for further medical research. The process of removing such information using machine learning techniques is started with pre-processing of raw data by tokenization and detection of sentences. On comparing the techniques, it is noted that CRFs require manual feature engineering to train the model whereas LSTM is capable of handling long term dependencies without much insight about the dataset. Bi-directional LSTM network was used to generate context information from suitable word representations. Finally, a predictive layer was applied to predict the protected health information (PHI) terms having maximum probability.Evaluated with the i2b2 gold data set of clinical narratives of patients of 2014 De-identification challenge, we propose an efficient solution for redaction using two models, both of which achieve good F-scores for PHIs of all types. The LSTM-based model achieved a micro-F1 measure of 0.9592, which performs better than the CRF-based model.