Linking patient records at scale with a hybrid approach combining contrastive learning and deterministic rules.

IF 1.3 Q3 BIOCHEMICAL RESEARCH METHODS

Biology Methods and Protocols Pub Date : 2026-02-09 eCollection Date: 2026-01-01 DOI:10.1093/biomethods/bpag009

Cheng Cao, Jay Pillai, Sara Daraei, Sina Ghadermarzi

{"title":"Linking patient records at scale with a hybrid approach combining contrastive learning and deterministic rules.","authors":"Cheng Cao, Jay Pillai, Sara Daraei, Sina Ghadermarzi","doi":"10.1093/biomethods/bpag009","DOIUrl":null,"url":null,"abstract":"<p><p>Linking patient records across disparate healthcare systems is essential to create comprehensive views of patient health, yet this task is complicated by inconsistent identifiers and data quality issues. Although traditional deterministic and probabilistic record linkage (RL) methods have long been used for this purpose, deterministic approaches are brittle in the presence of noisy personally identifiable information (PII), while probabilistic approaches are often difficult to scale. As a result, large-scale linkage commonly relies on restrictive matching strategies that limit recall. This work presents a hybrid RL approach that integrates a deep embedding model with deterministic rules, leveraging both the flexibility and noise robustness of soft embeddings and the reliability and predictable baseline performance of deterministic rules. Using a large-scale real-world dataset, a BERT-based embedding model is fine-tuned in a Siamese network with contrastive loss to encode PII fields as numeric vectors. De-duplicated identifiers (Fuzzy IDs) are then obtained through a blocking-and-clustering step using the embedding vectors. The approach is evaluated using multiple signals (social security number, phone, and email) and is shown to outperform baseline methods. A postprocessing step based on deterministic rules allows embedding-based linkage to be overridden in a subset of cases where high-confidence rules apply, such as when a high-quality identifier is available. The system is deployed on a commercial database consisting of more than 200 million PII records, demonstrating scalability in a real-world healthcare setting.</p>","PeriodicalId":36528,"journal":{"name":"Biology Methods and Protocols","volume":"11 1","pages":"bpag009"},"PeriodicalIF":1.3000,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12952525/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology Methods and Protocols","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/biomethods/bpag009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Linking patient records across disparate healthcare systems is essential to create comprehensive views of patient health, yet this task is complicated by inconsistent identifiers and data quality issues. Although traditional deterministic and probabilistic record linkage (RL) methods have long been used for this purpose, deterministic approaches are brittle in the presence of noisy personally identifiable information (PII), while probabilistic approaches are often difficult to scale. As a result, large-scale linkage commonly relies on restrictive matching strategies that limit recall. This work presents a hybrid RL approach that integrates a deep embedding model with deterministic rules, leveraging both the flexibility and noise robustness of soft embeddings and the reliability and predictable baseline performance of deterministic rules. Using a large-scale real-world dataset, a BERT-based embedding model is fine-tuned in a Siamese network with contrastive loss to encode PII fields as numeric vectors. De-duplicated identifiers (Fuzzy IDs) are then obtained through a blocking-and-clustering step using the embedding vectors. The approach is evaluated using multiple signals (social security number, phone, and email) and is shown to outperform baseline methods. A postprocessing step based on deterministic rules allows embedding-based linkage to be overridden in a subset of cases where high-confidence rules apply, such as when a high-quality identifier is available. The system is deployed on a commercial database consisting of more than 200 million PII records, demonstrating scalability in a real-world healthcare setting.

查看原文本刊更多论文

通过结合对比学习和确定性规则的混合方法，大规模地连接患者记录。

跨不同的医疗保健系统链接患者记录对于创建全面的患者健康视图至关重要，但是不一致的标识符和数据质量问题使这项任务变得复杂。尽管传统的确定性和概率记录链接（RL）方法长期以来一直用于此目的，但确定性方法在存在嘈杂的个人身份信息（PII）时很脆弱，而概率方法通常难以扩展。因此，大规模联系通常依赖于限制回忆的限制性匹配策略。这项工作提出了一种混合强化学习方法，该方法将深度嵌入模型与确定性规则集成在一起，利用软嵌入的灵活性和噪声鲁棒性以及确定性规则的可靠性和可预测基线性能。使用大规模的现实世界数据集，基于bert的嵌入模型在具有对比损失的Siamese网络中进行微调，以将PII字段编码为数字向量。然后使用嵌入向量通过块化聚类步骤获得重复数据删除的标识符（模糊id）。该方法使用多个信号（社会安全号码、电话和电子邮件）进行评估，并被证明优于基线方法。基于确定性规则的后处理步骤允许在应用高置信度规则的情况子集中覆盖基于嵌入的链接，例如当有高质量标识符可用时。该系统部署在由超过2亿条PII记录组成的商业数据库上，展示了实际医疗保健环境中的可伸缩性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊