A Markov Chain Replacement Strategy for Surrogate Identifiers: Minimizing Re-Identification Risk While Preserving Text Reuse.

IF 2.6 3区 工程技术 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Electronics Pub Date : 2025-10-01 Epub Date: 2025-10-06 DOI:10.3390/electronics14193945
John D Osborne, Andrew Trotter, Tobias O'Leary, Chris Coffee, Micah D Cochran, Luis Mansilla-Gonzalez, Akhil Nadimpalli, Alex McAnnally, Abdulateef I Almudaifer, Jeffrey R Curtis, Salma M Aly, Richard E Kennedy
{"title":"A Markov Chain Replacement Strategy for Surrogate Identifiers: Minimizing Re-Identification Risk While Preserving Text Reuse.","authors":"John D Osborne, Andrew Trotter, Tobias O'Leary, Chris Coffee, Micah D Cochran, Luis Mansilla-Gonzalez, Akhil Nadimpalli, Alex McAnnally, Abdulateef I Almudaifer, Jeffrey R Curtis, Salma M Aly, Richard E Kennedy","doi":"10.3390/electronics14193945","DOIUrl":null,"url":null,"abstract":"<p><p>\"Hiding in Plain Sight\" (HIPS) strategies for Personal Health Information (PHI) replace PHI with surrogate values to hinder re-identification attempts. We evaluate three different HIPS strategies for PHI replacement, a standard Consistent replacement strategy, a Random replacement strategy, and a novel Markov model strategy. We evaluate the privacy-preserving benefits and relative utility for information extraction of these strategies on both a simulated PHI distribution and real clinical corpora from two different institutions using a range of false negative error rates (FNER). The Markov strategy consistently outperformed the Consistent and Random substitution strategies on both real data and in statistical simulations. Using FNER ranging from 0.1% to 5%, PHI leakage at the document level could be reduced from 27.1% to 0.1% and from 94.2% to 57.7% with the Markov strategy versus the standard Consistent substitution strategy, at 0.1% and 0.5% FNER, respectively. Additionally, we assessed the generated corpora containing synthetic PHI for reuse using a variety of information extraction methods. Results indicate that modern deep learning methods have similar performance on all strategies, but older machine learning techniques can suffer from the change in context. Overall, a Markov surrogate generation strategy substantially reduces the chance of inadvertent PHI release.</p>","PeriodicalId":11646,"journal":{"name":"Electronics","volume":"14 19","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12536513/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.3390/electronics14193945","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/6 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

"Hiding in Plain Sight" (HIPS) strategies for Personal Health Information (PHI) replace PHI with surrogate values to hinder re-identification attempts. We evaluate three different HIPS strategies for PHI replacement, a standard Consistent replacement strategy, a Random replacement strategy, and a novel Markov model strategy. We evaluate the privacy-preserving benefits and relative utility for information extraction of these strategies on both a simulated PHI distribution and real clinical corpora from two different institutions using a range of false negative error rates (FNER). The Markov strategy consistently outperformed the Consistent and Random substitution strategies on both real data and in statistical simulations. Using FNER ranging from 0.1% to 5%, PHI leakage at the document level could be reduced from 27.1% to 0.1% and from 94.2% to 57.7% with the Markov strategy versus the standard Consistent substitution strategy, at 0.1% and 0.5% FNER, respectively. Additionally, we assessed the generated corpora containing synthetic PHI for reuse using a variety of information extraction methods. Results indicate that modern deep learning methods have similar performance on all strategies, but older machine learning techniques can suffer from the change in context. Overall, a Markov surrogate generation strategy substantially reduces the chance of inadvertent PHI release.

代理标识符的马尔可夫链替换策略:在保持文本重用的同时最小化重新识别风险。
个人健康信息(PHI)的“隐藏”(HIPS)策略用替代值替换PHI,以阻止重新识别尝试。我们评估了三种不同的HIPS策略,即标准的一致替换策略,随机替换策略和一种新的马尔可夫模型策略。我们使用一系列假阴性错误率(FNER)来评估这些策略在模拟PHI分布和来自两个不同机构的真实临床语料库上的信息提取的隐私保护效益和相对效用。在实际数据和统计模拟中,马尔可夫策略始终优于一致替代策略和随机替代策略。在FNER范围为0.1%至5%的情况下,与标准一致性替代策略相比,马尔可夫策略可以将文档级别的PHI泄漏从27.1%降低到0.1%,从94.2%降低到57.7%,FNER分别为0.1%和0.5%。此外,我们使用各种信息提取方法评估了包含合成PHI的生成语料库的重用情况。结果表明,现代深度学习方法在所有策略上都具有相似的性能,但旧的机器学习技术可能会受到上下文变化的影响。总的来说,Markov代理生成策略大大减少了无意中释放PHI的机会。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Electronics
Electronics Computer Science-Computer Networks and Communications
CiteScore
1.10
自引率
10.30%
发文量
3515
审稿时长
16.71 days
期刊介绍: Electronics (ISSN 2079-9292; CODEN: ELECGJ) is an international, open access journal on the science of electronics and its applications published quarterly online by MDPI.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信