探索现实世界健康数据记录链接的复杂性——一个连接癌症登记和索赔数据的示范研究。

IF 2.4 4区医学 Q3 PHARMACOLOGY & PHARMACY

Pharmacoepidemiology and Drug Safety Pub Date : 2025-04-01 DOI:10.1002/pds.70120

Nadja Lendle, Bianca Kollhorst, Timm Intemann

{"title":"探索现实世界健康数据记录链接的复杂性——一个连接癌症登记和索赔数据的示范研究。","authors":"Nadja Lendle, Bianca Kollhorst, Timm Intemann","doi":"10.1002/pds.70120","DOIUrl":null,"url":null,"abstract":"Purpose: Record linkage based on quasi-identifiers remains an important approach as not every data source provides a comprehensive unique identifier. In this study, the reasons for the failure of a linkage based on quasi-identifiers were examined. Furthermore, informed algorithms using information on gold standard links were developed to investigate the potentially achievable linkage quality based on quasi-identifiers.Methods: The study population includes patients from an antidiabetic cohort from German claims and colorectal cancer patients from two German cancer registries. Linkage algorithms were applied using information on gold standard links. Informed linkage algorithms based on deterministic linkage, logistic regression, random forests, gradient boosting, and neural networks were derived and compared. Descriptive analyses were performed to identify reasons for the failure of linkage, such as discrepancies between data sources.Results: A gradient boosting-based linkage approach performed best, achieving a precision (positive predictive value) of 77%, a recall (sensitivity) of 81%, and an F*-measure (combining precision and recall) of 64%. Of 641 patients in GePaRD, 8% were not uniquely identifiable using birth year, sex, area of residence, and year and quarter of diagnosis, whereas 33% of 42 817 cancer registry patients were not uniquely identifiable with these quasi-identifiers.Conclusions: Linkage of German claims and cancer registry data based on quasi-identifiers does result in insufficient linkage quality since subjects cannot be uniquely identified. It is advisable to use unique identifiers from a subsample, if available, to derive informed linkage algorithms for the entire sample. In this case, the machine learning technique gradient boosting has been found to outperform other methods.","PeriodicalId":19782,"journal":{"name":"Pharmacoepidemiology and Drug Safety","volume":"34 4","pages":"e70120"},"PeriodicalIF":2.4000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11934838/pdf/","citationCount":"0","resultStr":"{\"title\":\"Exploring the Complexity of Real-World Health Data Record Linkage-An Exemplary Study Linking Cancer Registry and Claims Data.\",\"authors\":\"Nadja Lendle, Bianca Kollhorst, Timm Intemann\",\"doi\":\"10.1002/pds.70120\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: Record linkage based on quasi-identifiers remains an important approach as not every data source provides a comprehensive unique identifier. In this study, the reasons for the failure of a linkage based on quasi-identifiers were examined. Furthermore, informed algorithms using information on gold standard links were developed to investigate the potentially achievable linkage quality based on quasi-identifiers.Methods: The study population includes patients from an antidiabetic cohort from German claims and colorectal cancer patients from two German cancer registries. Linkage algorithms were applied using information on gold standard links. Informed linkage algorithms based on deterministic linkage, logistic regression, random forests, gradient boosting, and neural networks were derived and compared. Descriptive analyses were performed to identify reasons for the failure of linkage, such as discrepancies between data sources.Results: A gradient boosting-based linkage approach performed best, achieving a precision (positive predictive value) of 77%, a recall (sensitivity) of 81%, and an F*-measure (combining precision and recall) of 64%. Of 641 patients in GePaRD, 8% were not uniquely identifiable using birth year, sex, area of residence, and year and quarter of diagnosis, whereas 33% of 42 817 cancer registry patients were not uniquely identifiable with these quasi-identifiers.Conclusions: Linkage of German claims and cancer registry data based on quasi-identifiers does result in insufficient linkage quality since subjects cannot be uniquely identified. It is advisable to use unique identifiers from a subsample, if available, to derive informed linkage algorithms for the entire sample. In this case, the machine learning technique gradient boosting has been found to outperform other methods.\",\"PeriodicalId\":19782,\"journal\":{\"name\":\"Pharmacoepidemiology and Drug Safety\",\"volume\":\"34 4\",\"pages\":\"e70120\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11934838/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pharmacoepidemiology and Drug Safety\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/pds.70120\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"PHARMACOLOGY & PHARMACY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pharmacoepidemiology and Drug Safety","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/pds.70120","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}

引用次数: 0

摘要

目的：基于准标识符的记录链接仍然是一种重要的方法，因为不是每个数据源都提供全面的唯一标识符。在本研究中，基于准标识符的链接失败的原因进行了检查。此外，基于准标识符，开发了基于金标准链接信息的知情算法来研究可能实现的链接质量。方法：研究人群包括来自德国索赔的抗糖尿病队列患者和来自德国两个癌症登记处的结直肠癌患者。利用金标准链路上的信息应用了链接算法。推导并比较了基于确定性链接、逻辑回归、随机森林、梯度增强和神经网络的知情链接算法。进行描述性分析以确定链接失败的原因，例如数据源之间的差异。结果：基于梯度提升的关联方法表现最好，准确率（阳性预测值）为77%，召回率（灵敏度）为81%，F*测量（结合精度和召回率）为64%。在641名GePaRD患者中，8%的患者不能通过出生年份、性别、居住区域、诊断年份和季度进行唯一识别，而42817名癌症登记患者中有33%不能通过这些准标识符进行唯一识别。结论：基于准标识符的德国索赔和癌症登记数据的链接确实导致链接质量不足，因为受试者不能唯一识别。建议使用子样本中的唯一标识符（如果可用）来推导整个样本的知情链接算法。在这种情况下，已经发现机器学习技术梯度增强优于其他方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Exploring the Complexity of Real-World Health Data Record Linkage-An Exemplary Study Linking Cancer Registry and Claims Data.

查看原文本刊更多论文

Exploring the Complexity of Real-World Health Data Record Linkage-An Exemplary Study Linking Cancer Registry and Claims Data.

Purpose: Record linkage based on quasi-identifiers remains an important approach as not every data source provides a comprehensive unique identifier. In this study, the reasons for the failure of a linkage based on quasi-identifiers were examined. Furthermore, informed algorithms using information on gold standard links were developed to investigate the potentially achievable linkage quality based on quasi-identifiers.

Methods: The study population includes patients from an antidiabetic cohort from German claims and colorectal cancer patients from two German cancer registries. Linkage algorithms were applied using information on gold standard links. Informed linkage algorithms based on deterministic linkage, logistic regression, random forests, gradient boosting, and neural networks were derived and compared. Descriptive analyses were performed to identify reasons for the failure of linkage, such as discrepancies between data sources.

Results: A gradient boosting-based linkage approach performed best, achieving a precision (positive predictive value) of 77%, a recall (sensitivity) of 81%, and an F*-measure (combining precision and recall) of 64%. Of 641 patients in GePaRD, 8% were not uniquely identifiable using birth year, sex, area of residence, and year and quarter of diagnosis, whereas 33% of 42 817 cancer registry patients were not uniquely identifiable with these quasi-identifiers.

Conclusions: Linkage of German claims and cancer registry data based on quasi-identifiers does result in insufficient linkage quality since subjects cannot be uniquely identified. It is advisable to use unique identifiers from a subsample, if available, to derive informed linkage algorithms for the entire sample. In this case, the machine learning technique gradient boosting has been found to outperform other methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pharmacoepidemiology and Drug Safety 医学-药学

CiteScore

4.80

自引率

7.70%

发文量

173

审稿时长

3 months

期刊介绍： The aim of Pharmacoepidemiology and Drug Safety is to provide an international forum for the communication and evaluation of data, methods and opinion in the discipline of pharmacoepidemiology. The Journal publishes peer-reviewed reports of original research, invited reviews and a variety of guest editorials and commentaries embracing scientific, medical, statistical, legal and economic aspects of pharmacoepidemiology and post-marketing surveillance of drug safety. Appropriate material in these categories may also be considered for publication as a Brief Report. Particular areas of interest include: design, analysis, results, and interpretation of studies looking at the benefit or safety of specific pharmaceuticals, biologics, or medical devices, including studies in pharmacovigilance, postmarketing surveillance, pharmacoeconomics, patient safety, molecular pharmacoepidemiology, or any other study within the broad field of pharmacoepidemiology; comparative effectiveness research relating to pharmaceuticals, biologics, and medical devices. Comparative effectiveness research is the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat, and monitor a clinical condition, as these methods are truly used in the real world; methodologic contributions of relevance to pharmacoepidemiology, whether original contributions, reviews of existing methods, or tutorials for how to apply the methods of pharmacoepidemiology; assessments of harm versus benefit in drug therapy; patterns of drug utilization; relationships between pharmacoepidemiology and the formulation and interpretation of regulatory guidelines; evaluations of risk management plans and programmes relating to pharmaceuticals, biologics and medical devices.