John D Osborne, Andrew Trotter, Tobias O'Leary, Chris Coffee, Micah D Cochran, Luis Mansilla-Gonzalez, Akhil Nadimpalli, Alex McAnnally, Abdulateef I Almudaifer, Jeffrey R Curtis, Salma M Aly, Richard E Kennedy
{"title":"A Markov Chain Replacement Strategy for Surrogate Identifiers: Minimizing Re-Identification Risk While Preserving Text Reuse.","authors":"John D Osborne, Andrew Trotter, Tobias O'Leary, Chris Coffee, Micah D Cochran, Luis Mansilla-Gonzalez, Akhil Nadimpalli, Alex McAnnally, Abdulateef I Almudaifer, Jeffrey R Curtis, Salma M Aly, Richard E Kennedy","doi":"10.3390/electronics14193945","DOIUrl":null,"url":null,"abstract":"<p><p>\"Hiding in Plain Sight\" (HIPS) strategies for Personal Health Information (PHI) replace PHI with surrogate values to hinder re-identification attempts. We evaluate three different HIPS strategies for PHI replacement, a standard Consistent replacement strategy, a Random replacement strategy, and a novel Markov model strategy. We evaluate the privacy-preserving benefits and relative utility for information extraction of these strategies on both a simulated PHI distribution and real clinical corpora from two different institutions using a range of false negative error rates (FNER). The Markov strategy consistently outperformed the Consistent and Random substitution strategies on both real data and in statistical simulations. Using FNER ranging from 0.1% to 5%, PHI leakage at the document level could be reduced from 27.1% to 0.1% and from 94.2% to 57.7% with the Markov strategy versus the standard Consistent substitution strategy, at 0.1% and 0.5% FNER, respectively. Additionally, we assessed the generated corpora containing synthetic PHI for reuse using a variety of information extraction methods. Results indicate that modern deep learning methods have similar performance on all strategies, but older machine learning techniques can suffer from the change in context. Overall, a Markov surrogate generation strategy substantially reduces the chance of inadvertent PHI release.</p>","PeriodicalId":11646,"journal":{"name":"Electronics","volume":"14 19","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12536513/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.3390/electronics14193945","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/6 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
"Hiding in Plain Sight" (HIPS) strategies for Personal Health Information (PHI) replace PHI with surrogate values to hinder re-identification attempts. We evaluate three different HIPS strategies for PHI replacement, a standard Consistent replacement strategy, a Random replacement strategy, and a novel Markov model strategy. We evaluate the privacy-preserving benefits and relative utility for information extraction of these strategies on both a simulated PHI distribution and real clinical corpora from two different institutions using a range of false negative error rates (FNER). The Markov strategy consistently outperformed the Consistent and Random substitution strategies on both real data and in statistical simulations. Using FNER ranging from 0.1% to 5%, PHI leakage at the document level could be reduced from 27.1% to 0.1% and from 94.2% to 57.7% with the Markov strategy versus the standard Consistent substitution strategy, at 0.1% and 0.5% FNER, respectively. Additionally, we assessed the generated corpora containing synthetic PHI for reuse using a variety of information extraction methods. Results indicate that modern deep learning methods have similar performance on all strategies, but older machine learning techniques can suffer from the change in context. Overall, a Markov surrogate generation strategy substantially reduces the chance of inadvertent PHI release.
ElectronicsComputer Science-Computer Networks and Communications
CiteScore
1.10
自引率
10.30%
发文量
3515
审稿时长
16.71 days
期刊介绍:
Electronics (ISSN 2079-9292; CODEN: ELECGJ) is an international, open access journal on the science of electronics and its applications published quarterly online by MDPI.