[Anonymization of general practitioners' electronic medical records in two research datasets].

IF 0.7 4区医学 Q4 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH

Gesundheitswesen Pub Date : 2025-07-14 DOI:10.1055/a-2624-0084

Johannes Hauswaldt, Roland Groh, Knut Kaulke, Falk Schlegelmilch, Alireza Zarei, Eva Hummers

{"title":"[Anonymization of general practitioners' electronic medical records in two research datasets].","authors":"Johannes Hauswaldt, Roland Groh, Knut Kaulke, Falk Schlegelmilch, Alireza Zarei, Eva Hummers","doi":"10.1055/a-2624-0084","DOIUrl":null,"url":null,"abstract":"A dataset can be called \"anonymous\" only if its content cannot be related to a person, not by any means and not even ex post or by combination with other information. Free text entries highly impede \"factual anonymization\" for secondary research. Using two source datasets from GPs' electronic medical records (EMR), we aimed at de-identification in an iterative and systematic search for potentially identifying field content (PIF).EMR data of 14,285 to 100 GP patients with 40 variables (parameters, fields) in 5,918,321 resp. 363,084 data lines were analyzed at four levels: field labels, their combination, field content, dataset as a whole. Field labels were arranged into eleven semantic groups according to field type, their frequencies examined and their combination evaluated by GP experts rating the re-identification risk. Iteratively we searched for free text PIFs and masked them for the subsequent steps. The ratio of PIF data lines' number over total number yielded final probability estimators. In addition, we processed a whole dataset using ARX open source software for anonymizing sensitive personal data. Results were evaluated in a data protection impact assessment according to article 35 GDPR, with respect to the severity of privacy breach and to its estimated probability.We found a high risk of re-identification with free text entries into \"history\", \"current diagnosis\", \"medication\" and \"findings\" even after repeated algorithmic text-mining and natural language processing. Scrupulous pre-selection of variables, data parsimony, privacy by design in data processing and measures described here may reduce the risk considerably, but will not result in a \"factually anonymized\" research dataset.To identify and assess re-identifying field content is mandatory for privacy protection but anonymization can be reached only partly by reasonable efforts. Semantic structuring of data is pre-conditional but does not help with erroneous entries.","PeriodicalId":47653,"journal":{"name":"Gesundheitswesen","volume":" ","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Gesundheitswesen","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2624-0084","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}

引用次数: 0

Abstract

A dataset can be called "anonymous" only if its content cannot be related to a person, not by any means and not even ex post or by combination with other information. Free text entries highly impede "factual anonymization" for secondary research. Using two source datasets from GPs' electronic medical records (EMR), we aimed at de-identification in an iterative and systematic search for potentially identifying field content (PIF).EMR data of 14,285 to 100 GP patients with 40 variables (parameters, fields) in 5,918,321 resp. 363,084 data lines were analyzed at four levels: field labels, their combination, field content, dataset as a whole. Field labels were arranged into eleven semantic groups according to field type, their frequencies examined and their combination evaluated by GP experts rating the re-identification risk. Iteratively we searched for free text PIFs and masked them for the subsequent steps. The ratio of PIF data lines' number over total number yielded final probability estimators. In addition, we processed a whole dataset using ARX open source software for anonymizing sensitive personal data. Results were evaluated in a data protection impact assessment according to article 35 GDPR, with respect to the severity of privacy breach and to its estimated probability.We found a high risk of re-identification with free text entries into "history", "current diagnosis", "medication" and "findings" even after repeated algorithmic text-mining and natural language processing. Scrupulous pre-selection of variables, data parsimony, privacy by design in data processing and measures described here may reduce the risk considerably, but will not result in a "factually anonymized" research dataset.To identify and assess re-identifying field content is mandatory for privacy protection but anonymization can be reached only partly by reasonable efforts. Semantic structuring of data is pre-conditional but does not help with erroneous entries.

查看原文本刊更多论文

[两个研究数据集中全科医生电子病历的匿名化]。

一个数据集只有在其内容与某个人无关时才能被称为“匿名”，无论如何都不能，甚至不能邮寄或与其他信息结合在一起。免费文本条目严重阻碍了二次研究的“事实匿名化”。使用来自全科医生电子病历（EMR）的两个源数据集，我们的目标是在迭代和系统搜索潜在识别字段内容（PIF）中去识别。5918321例共14285 ~ 100例GP患者的EMR数据，包含40个变量（参数、字段）。在四个层面上分析了363,084条数据线：字段标签、它们的组合、字段内容、整个数据集。根据字段类型，将字段标签分为11个语义组，并由GP专家对其组合进行再识别风险评估。我们迭代地搜索免费文本pif，并为后续步骤屏蔽它们。PIF数据线数与总数之比产生了最终的概率估计值。此外，我们使用ARX开源软件处理了整个数据集，用于匿名化敏感的个人数据。根据GDPR第35条，根据隐私泄露的严重程度及其估计概率，在数据保护影响评估中对结果进行了评估。我们发现，即使在重复的算法文本挖掘和自然语言处理之后，“历史”、“当前诊断”、“药物”和“发现”中的免费文本条目被重新识别的风险很高。严谨的变量预选、数据节约、数据处理中的隐私设计和本文描述的措施可能会大大降低风险，但不会导致“事实上匿名”的研究数据集。识别和评估重新识别字段内容是隐私保护的强制性要求，但匿名化只能通过合理的努力部分实现。数据的语义结构是有先决条件的，但对错误条目没有帮助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Gesundheitswesen PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH-

CiteScore

1.90

自引率

18.20%

发文量

308

期刊介绍： The health service informs you comprehensively and up-to-date about the most important topics of the health care system. In addition to guidelines, overviews and comments, you will find current research results and contributions to CME-certified continuing education and training. The journal offers a scientific discussion forum and a platform for communications from professional societies. The content quality is ensured by a publisher body, the expert advisory board and other experts in the peer review process.