Adapting historical clinical genetic test records for anonymised data linkage: obstacles and opportunities.

IF 2.2 Q3 HEALTH CARE SCIENCES & SERVICES

International Journal of Population Data Science Pub Date : 2025-02-20 eCollection Date: 2023-01-01 DOI:10.23889/ijpds.v8i5.2924

Robert T Maddison, Karen R Reed, Rebecca Cannings-John, Fiona Lugg-Widger, Thomas Stoneman, Sarah Anderson, Andrew E Fry

{"title":"Adapting historical clinical genetic test records for anonymised data linkage: obstacles and opportunities.","authors":"Robert T Maddison, Karen R Reed, Rebecca Cannings-John, Fiona Lugg-Widger, Thomas Stoneman, Sarah Anderson, Andrew E Fry","doi":"10.23889/ijpds.v8i5.2924","DOIUrl":null,"url":null,"abstract":"Introduction: Cystic fibrosis (CF) heterozygotes (also known as 'carriers') are people who have one mutated copy of the CFTR gene. Research into the health risks of CF carriers has been limited by a lack of large cohorts tested for CF carrier status, but routine clinical testing identifies CF carriers in the population. Such test records additionally contain large amounts of clinical information, making them a valuable research resource to not only identify CF carriers in the population but also to provide additional data not found elsewhere.Methods: Following governance approvals, we adapted 30 years worth of CF genetic testing records generated by the All-Wales Medical Genomics Service (AWMGS) and submitted them to the SAIL Databank for anonymised linkage.Results: Unexpected obstacles meant that a minimum amount of clinical information could be annotated ahead of linkage. The raw data were highly heterogeneous due to the records' longitudinal collection and clinical origins, making standardisation difficult. Moreover, the presence of unique identifiers in the clinical data violated the separation principle, requiring manual annotation to produce a cleaned dataset. Explicit identification of patients or their relatives throughout the records complicated split file anonymisation.Conclusion: Extracting useful information from historical clinical genetic test records is a significant challenge with technical and governance aspects. The mixing of unique identifiers with clinical data in heterogeneous, unstructured free text combined with a lack of automated tools meant that manual annotation was required to adhere to the separation principle. As such, only a minimum of the available clinical data was annotatable within the project timeline and mutually exclusive access to the identifiable and pseudonymised data meant that annotations could not later be validated. Future efforts to link clinical genetic test records for research must consider these challenges in their approach.","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"8 5","pages":"2924"},"PeriodicalIF":2.2000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11922013/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v8i5.2924","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Cystic fibrosis (CF) heterozygotes (also known as 'carriers') are people who have one mutated copy of the CFTR gene. Research into the health risks of CF carriers has been limited by a lack of large cohorts tested for CF carrier status, but routine clinical testing identifies CF carriers in the population. Such test records additionally contain large amounts of clinical information, making them a valuable research resource to not only identify CF carriers in the population but also to provide additional data not found elsewhere.

Methods: Following governance approvals, we adapted 30 years worth of CF genetic testing records generated by the All-Wales Medical Genomics Service (AWMGS) and submitted them to the SAIL Databank for anonymised linkage.

Results: Unexpected obstacles meant that a minimum amount of clinical information could be annotated ahead of linkage. The raw data were highly heterogeneous due to the records' longitudinal collection and clinical origins, making standardisation difficult. Moreover, the presence of unique identifiers in the clinical data violated the separation principle, requiring manual annotation to produce a cleaned dataset. Explicit identification of patients or their relatives throughout the records complicated split file anonymisation.

Conclusion: Extracting useful information from historical clinical genetic test records is a significant challenge with technical and governance aspects. The mixing of unique identifiers with clinical data in heterogeneous, unstructured free text combined with a lack of automated tools meant that manual annotation was required to adhere to the separation principle. As such, only a minimum of the available clinical data was annotatable within the project timeline and mutually exclusive access to the identifiable and pseudonymised data meant that annotations could not later be validated. Future efforts to link clinical genetic test records for research must consider these challenges in their approach.

查看原文本刊更多论文

适应历史临床基因检测记录的匿名数据链接：障碍和机遇。

简介：囊性纤维化（CF）杂合子（也称为“携带者”）是指具有CFTR基因突变拷贝的人。对CF携带者健康风险的研究由于缺乏CF携带者状态的大型队列检测而受到限制，但常规临床检测可识别人群中的CF携带者。这些检测记录还包含大量的临床信息，使其成为一种宝贵的研究资源，不仅可以识别人群中的CF携带者，还可以提供其他地方找不到的额外数据。方法：在政府批准后，我们改编了由全威尔士医学基因组学服务（AWMGS）生成的30年CF基因检测记录，并将其提交给SAIL数据库进行匿名链接。结果：意想不到的障碍意味着在连接之前可以注释最少的临床信息。由于记录的纵向收集和临床来源，原始数据高度异质性，使标准化变得困难。此外，临床数据中存在的唯一标识符违反了分离原则，需要手工标注才能生成一个干净的数据集。明确识别患者或其亲属的全程记录复杂的分割文件匿名化。结论：从历史临床基因检测记录中提取有用信息是技术和管理方面的重大挑战。在异构的、非结构化的自由文本中混合了临床数据的唯一标识符，再加上缺乏自动化工具，这意味着需要手动注释来坚持分离原则。因此，在项目时间表内，只有最少的可用临床数据是可注释的，并且对可识别数据和假名数据的互斥访问意味着注释不能在以后进行验证。未来将临床基因检测记录与研究联系起来的努力必须考虑到这些挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊