Johannes Hauswaldt, Roland Groh, Knut Kaulke, Falk Schlegelmilch, Alireza Zarei, Eva Hummers
{"title":"[Anonymization of general practitioners' electronic medical records in two research datasets].","authors":"Johannes Hauswaldt, Roland Groh, Knut Kaulke, Falk Schlegelmilch, Alireza Zarei, Eva Hummers","doi":"10.1055/a-2624-0084","DOIUrl":null,"url":null,"abstract":"<p><p>A dataset can be called \"anonymous\" only if its content cannot be related to a person, not by any means and not even <i>ex post</i> or by combination with other information. Free text entries highly impede \"factual anonymization\" for secondary research. Using two source datasets from GPs' electronic medical records (EMR), we aimed at de-identification in an iterative and systematic search for potentially identifying field content (PIF).EMR data of 14,285 to 100 GP patients with 40 variables (parameters, fields) in 5,918,321 resp. 363,084 data lines were analyzed at four levels: field labels, their combination, field content, dataset as a whole. Field labels were arranged into eleven semantic groups according to field type, their frequencies examined and their combination evaluated by GP experts rating the re-identification risk. Iteratively we searched for free text PIFs and masked them for the subsequent steps. The ratio of PIF data lines' number over total number yielded final probability estimators. In addition, we processed a whole dataset using ARX open source software for anonymizing sensitive personal data. Results were evaluated in a data protection impact assessment according to article 35 GDPR, with respect to the severity of privacy breach and to its estimated probability.We found a high risk of re-identification with free text entries into \"history\", \"current diagnosis\", \"medication\" and \"findings\" even after repeated algorithmic text-mining and natural language processing. Scrupulous pre-selection of variables, data parsimony, privacy by design in data processing and measures described here may reduce the risk considerably, but will not result in a \"factually anonymized\" research dataset.To identify and assess re-identifying field content is mandatory for privacy protection but anonymization can be reached only partly by reasonable efforts. Semantic structuring of data is pre-conditional but does not help with erroneous entries.</p>","PeriodicalId":47653,"journal":{"name":"Gesundheitswesen","volume":" ","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Gesundheitswesen","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2624-0084","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0
Abstract
A dataset can be called "anonymous" only if its content cannot be related to a person, not by any means and not even ex post or by combination with other information. Free text entries highly impede "factual anonymization" for secondary research. Using two source datasets from GPs' electronic medical records (EMR), we aimed at de-identification in an iterative and systematic search for potentially identifying field content (PIF).EMR data of 14,285 to 100 GP patients with 40 variables (parameters, fields) in 5,918,321 resp. 363,084 data lines were analyzed at four levels: field labels, their combination, field content, dataset as a whole. Field labels were arranged into eleven semantic groups according to field type, their frequencies examined and their combination evaluated by GP experts rating the re-identification risk. Iteratively we searched for free text PIFs and masked them for the subsequent steps. The ratio of PIF data lines' number over total number yielded final probability estimators. In addition, we processed a whole dataset using ARX open source software for anonymizing sensitive personal data. Results were evaluated in a data protection impact assessment according to article 35 GDPR, with respect to the severity of privacy breach and to its estimated probability.We found a high risk of re-identification with free text entries into "history", "current diagnosis", "medication" and "findings" even after repeated algorithmic text-mining and natural language processing. Scrupulous pre-selection of variables, data parsimony, privacy by design in data processing and measures described here may reduce the risk considerably, but will not result in a "factually anonymized" research dataset.To identify and assess re-identifying field content is mandatory for privacy protection but anonymization can be reached only partly by reasonable efforts. Semantic structuring of data is pre-conditional but does not help with erroneous entries.
期刊介绍:
The health service informs you comprehensively and up-to-date about the most important topics of the health care system. In addition to guidelines, overviews and comments, you will find current research results and contributions to CME-certified continuing education and training. The journal offers a scientific discussion forum and a platform for communications from professional societies. The content quality is ensured by a publisher body, the expert advisory board and other experts in the peer review process.