Ten quick tips for protecting health data using de-identification and perturbation of structured datasets.

IF 3.6 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

PLoS Computational Biology Pub Date : 2025-09-23 eCollection Date: 2025-09-01 DOI:10.1371/journal.pcbi.1013507

Tshikala Eddie Lulamba, Themba Mutemaringa, Nicki Tiffin

{"title":"Ten quick tips for protecting health data using de-identification and perturbation of structured datasets.","authors":"Tshikala Eddie Lulamba, Themba Mutemaringa, Nicki Tiffin","doi":"10.1371/journal.pcbi.1013507","DOIUrl":null,"url":null,"abstract":"<p><p>Structured patient data generated within the health data ecosystem are shared both internally for operational use and also externally for research and public health benefit. Protecting individual privacy and health data confidentiality in these contexts relies on data de-identification and anonymisation, although there are no universally accepted standards for these processes and the techniques involved can be technically complex. We present practical recommendations grounded in the principle of data minimisation-avoiding unnecessary granularity and identifying variables that could lead to re-identification when combined with other datasets. We provide practical guidance for anonymising and perturbing structured health data in ways that support compliance with data protection laws, describing technical and operational methods for reducing re-identification risk that include rounding numerical values, replacing precise values with ranges, adding jitter to numeric fields, aggregating data, management of date values and separating sensitive fields from identifying data to prevent linkage leading to re-identification. While some methods require advanced technical knowledge, we focus here on accessible strategies that can be implemented without specialist expertise, recognising the importance of the legal and governance frameworks in which anonymisation occurs. These guidelines support researchers, data managers and institutions in sharing health data responsibly, maintaining data utility while upholding privacy and promoting ethical and legal data stewardship for data-driven health research.</p>","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":"21 9","pages":"e1013507"},"PeriodicalIF":3.6000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12456793/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pcbi.1013507","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Structured patient data generated within the health data ecosystem are shared both internally for operational use and also externally for research and public health benefit. Protecting individual privacy and health data confidentiality in these contexts relies on data de-identification and anonymisation, although there are no universally accepted standards for these processes and the techniques involved can be technically complex. We present practical recommendations grounded in the principle of data minimisation-avoiding unnecessary granularity and identifying variables that could lead to re-identification when combined with other datasets. We provide practical guidance for anonymising and perturbing structured health data in ways that support compliance with data protection laws, describing technical and operational methods for reducing re-identification risk that include rounding numerical values, replacing precise values with ranges, adding jitter to numeric fields, aggregating data, management of date values and separating sensitive fields from identifying data to prevent linkage leading to re-identification. While some methods require advanced technical knowledge, we focus here on accessible strategies that can be implemented without specialist expertise, recognising the importance of the legal and governance frameworks in which anonymisation occurs. These guidelines support researchers, data managers and institutions in sharing health data responsibly, maintaining data utility while upholding privacy and promoting ethical and legal data stewardship for data-driven health research.

Abstract Image

查看原文本刊更多论文

使用结构化数据集的去识别和扰动来保护健康数据的十个快速提示。

在健康数据生态系统内生成的结构化患者数据在内部共享，供业务使用，也在外部共享，用于研究和公共卫生利益。在这些情况下保护个人隐私和健康数据机密性依赖于数据去识别化和匿名化，尽管这些过程没有普遍接受的标准，而且所涉及的技术在技术上可能很复杂。我们提出了基于数据最小化原则的实用建议——避免不必要的粒度和识别与其他数据集结合时可能导致重新识别的变量。我们以支持遵守数据保护法的方式为匿名化和干扰结构化健康数据提供实用指导，描述减少重新识别风险的技术和操作方法，包括舍入数值、用范围替换精确值、在数字字段中添加抖动、聚合数据、日期值管理以及将敏感字段与识别数据分开以防止导致重新识别的链接。虽然有些方法需要先进的技术知识，但我们在这里关注的是可以在没有专业知识的情况下实施的可访问策略，认识到匿名发生的法律和治理框架的重要性。这些准则支持研究人员、数据管理人员和机构负责任地共享卫生数据，保持数据效用，同时维护隐私，促进数据驱动型卫生研究的道德和法律数据管理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS Computational Biology BIOCHEMICAL RESEARCH METHODS-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

7.10

自引率

4.70%

发文量

820

审稿时长

2.5 months

期刊介绍： PLOS Computational Biology features works of exceptional significance that further our understanding of living systems at all scales—from molecules and cells, to patient populations and ecosystems—through the application of computational methods. Readers include life and computational scientists, who can take the important findings presented here to the next level of discovery. Research articles must be declared as belonging to a relevant section. More information about the sections can be found in the submission guidelines. Research articles should model aspects of biological systems, demonstrate both methodological and scientific novelty, and provide profound new biological insights. Generally, reliability and significance of biological discovery through computation should be validated and enriched by experimental studies. Inclusion of experimental validation is not required for publication, but should be referenced where possible. Inclusion of experimental validation of a modest biological discovery through computation does not render a manuscript suitable for PLOS Computational Biology. Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities.