Krzysztof Macierzanka, Arunashis Sau, Konstantinos Patlatzoglou, Libor Pastika, Ewa Sieliwonczyk, Mehak Gurnani, Nicholas S Peters, Jonathan W Waks, Daniel B Kramer, Fu Siong Ng
{"title":"Siamese neural network-enhanced electrocardiography can re-identify anonymized healthcare data.","authors":"Krzysztof Macierzanka, Arunashis Sau, Konstantinos Patlatzoglou, Libor Pastika, Ewa Sieliwonczyk, Mehak Gurnani, Nicholas S Peters, Jonathan W Waks, Daniel B Kramer, Fu Siong Ng","doi":"10.1093/ehjdh/ztaf011","DOIUrl":null,"url":null,"abstract":"<p><strong>Aims: </strong>Many research databases with anonymized patient data contain electrocardiograms (ECGs) from which traditional identifiers have been removed. We evaluated the ability of artificial intelligence (AI) methods to determine the similarity between ECGs and assessed whether they have the potential to be misused to re-identify individuals from anonymized datasets.</p><p><strong>Methods and results: </strong>We utilized a convolutional Siamese neural network (SNN) architecture, which derives a Euclidean distance similarity metric between two input ECGs. A secondary care dataset of 864 283 ECGs (72 455 subjects) was used. Siamese neural network-electrocardiogram (SNN-ECG) achieves an accuracy of 91.68% when classifying between 2 689 124 same-subject pairs and 2 689 124 different-subject pairs. This performance increases to 93.61% and 95.97% in outpatient and normal ECG subsets. In a simulated 'motivated intruder' test, SNN-ECG can identify individuals from large datasets. In datasets of 100, 1000, 10 000, and 20 000 ECGs, where only one ECG is also from the reference individual, it achieves success rates of 79.2%, 62.6%, 45.0%, and 40.0%, respectively. If this was random, the success would be 1%, 0.1%, 0.01%, and 0.005%, respectively. Additional basic information, like subject sex or age-range, enhances performance further. We also found that, on the subject level, ECG pair similarity is clinically relevant; greater ECG dissimilarity associates with all-cause mortality [hazard ratio, 1.22 (1.21-1.23), <i>P</i> < 0.0001] and is additive to an AI-ECG model trained for mortality prediction.</p><p><strong>Conclusion: </strong>Anonymized ECGs retain information that may facilitate subject re-identification, raising privacy and data protection concerns. However, SNN-ECG models also have positive uses and can enhance risk prediction of cardiovascular disease.</p>","PeriodicalId":72965,"journal":{"name":"European heart journal. Digital health","volume":"6 3","pages":"417-426"},"PeriodicalIF":3.9000,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12088719/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European heart journal. Digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/ehjdh/ztaf011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Aims: Many research databases with anonymized patient data contain electrocardiograms (ECGs) from which traditional identifiers have been removed. We evaluated the ability of artificial intelligence (AI) methods to determine the similarity between ECGs and assessed whether they have the potential to be misused to re-identify individuals from anonymized datasets.
Methods and results: We utilized a convolutional Siamese neural network (SNN) architecture, which derives a Euclidean distance similarity metric between two input ECGs. A secondary care dataset of 864 283 ECGs (72 455 subjects) was used. Siamese neural network-electrocardiogram (SNN-ECG) achieves an accuracy of 91.68% when classifying between 2 689 124 same-subject pairs and 2 689 124 different-subject pairs. This performance increases to 93.61% and 95.97% in outpatient and normal ECG subsets. In a simulated 'motivated intruder' test, SNN-ECG can identify individuals from large datasets. In datasets of 100, 1000, 10 000, and 20 000 ECGs, where only one ECG is also from the reference individual, it achieves success rates of 79.2%, 62.6%, 45.0%, and 40.0%, respectively. If this was random, the success would be 1%, 0.1%, 0.01%, and 0.005%, respectively. Additional basic information, like subject sex or age-range, enhances performance further. We also found that, on the subject level, ECG pair similarity is clinically relevant; greater ECG dissimilarity associates with all-cause mortality [hazard ratio, 1.22 (1.21-1.23), P < 0.0001] and is additive to an AI-ECG model trained for mortality prediction.
Conclusion: Anonymized ECGs retain information that may facilitate subject re-identification, raising privacy and data protection concerns. However, SNN-ECG models also have positive uses and can enhance risk prediction of cardiovascular disease.