Evaluating re-identification risks scores in publicly available clinical trial datasets: Insights and implications.

IF 2.2 3区医学 Q3 MEDICINE, RESEARCH & EXPERIMENTAL

Clinical Trials Pub Date : 2025-08-22 DOI:10.1177/17407745251356423

Aryelly Rodriguez, Linda J Williams, Stephanie C Lewis, Pamela Sinclair, Sandra Eldridge, Tracy Jackson, Christopher J Weir

{"title":"Evaluating re-identification risks scores in publicly available clinical trial datasets: Insights and implications.","authors":"Aryelly Rodriguez, Linda J Williams, Stephanie C Lewis, Pamela Sinclair, Sandra Eldridge, Tracy Jackson, Christopher J Weir","doi":"10.1177/17407745251356423","DOIUrl":null,"url":null,"abstract":"<p><p>BackgroundThe motivations to share anonymised datasets from clinical trials within the scientific community are increasing. Many anonymised datasets are now publicly available for secondary research. However, it is uncertain whether they pose a privacy risk to the involved participants.MethodsWe located a broad sample of publicly available, de-identified/anonymised randomised clinical trial datasets from human participants and contacted their owners to request access, following their local procedures. We classified personal data within these datasets, including unique direct identifiers such as date of birth and other personal data that, on their own, does not identify an individual but may do so when combined with each other, such as sex, age and race (indirect identifiers). Combining indirect identifiers forms strata, and adding more identifiers increases granularity by dividing the data into a larger number of smaller strata. The re-identification risk score equations evaluate membership in these strata in three ways: first, by measuring the proportions of participants in strata above predetermined risk threshold levels (Ra); second, by locating the smallest stratum (Rb); third, by estimating the average membership across all strata in a dataset (Rc). The risk scores range from 0 (lowest risk) to 1 (highest risk); they do not aim to re-identify individuals in the datasets and are used for routinely collected health records. If a dataset contained a direct identifier, it automatically scored 1 in all metrics. Conversely, if a dataset contained no direct or up to one indirect identifier, it automatically scored 0 in all metrics. Finally, we explored which characteristics of the datasets were associated with the risk scores and compared the risk scores and their usability.ResultsSeventy datasets from 14 data sources were analysed. Thirty-one datasets were shared with minimal restrictions (open access), while 39 were shared with varying levels of restrictions before access was granted (controlled access). Datasets had, on average, four identifiers and mean risk scores ranging from 0.47 to 0.91. The most common pieces of information present in the datasets that, when combined, may indirectly identify a participant were sex (80%) and age (72.9%).ConclusionsThis study confirms that clinical trial datasets are rich in personal details and that using re-identification risk scores as a measure of this richness is feasible. These scores could inform the anonymisation process of clinical trials datasets regarding their level of granularity prior to releasing them for secondary research. We propose a strategy for employing these scores in the decision-making process for releasing clinical trials datasets.</p>","PeriodicalId":10685,"journal":{"name":"Clinical Trials","volume":" ","pages":"17407745251356423"},"PeriodicalIF":2.2000,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Trials","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/17407745251356423","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

BackgroundThe motivations to share anonymised datasets from clinical trials within the scientific community are increasing. Many anonymised datasets are now publicly available for secondary research. However, it is uncertain whether they pose a privacy risk to the involved participants.MethodsWe located a broad sample of publicly available, de-identified/anonymised randomised clinical trial datasets from human participants and contacted their owners to request access, following their local procedures. We classified personal data within these datasets, including unique direct identifiers such as date of birth and other personal data that, on their own, does not identify an individual but may do so when combined with each other, such as sex, age and race (indirect identifiers). Combining indirect identifiers forms strata, and adding more identifiers increases granularity by dividing the data into a larger number of smaller strata. The re-identification risk score equations evaluate membership in these strata in three ways: first, by measuring the proportions of participants in strata above predetermined risk threshold levels (Ra); second, by locating the smallest stratum (Rb); third, by estimating the average membership across all strata in a dataset (Rc). The risk scores range from 0 (lowest risk) to 1 (highest risk); they do not aim to re-identify individuals in the datasets and are used for routinely collected health records. If a dataset contained a direct identifier, it automatically scored 1 in all metrics. Conversely, if a dataset contained no direct or up to one indirect identifier, it automatically scored 0 in all metrics. Finally, we explored which characteristics of the datasets were associated with the risk scores and compared the risk scores and their usability.ResultsSeventy datasets from 14 data sources were analysed. Thirty-one datasets were shared with minimal restrictions (open access), while 39 were shared with varying levels of restrictions before access was granted (controlled access). Datasets had, on average, four identifiers and mean risk scores ranging from 0.47 to 0.91. The most common pieces of information present in the datasets that, when combined, may indirectly identify a participant were sex (80%) and age (72.9%).ConclusionsThis study confirms that clinical trial datasets are rich in personal details and that using re-identification risk scores as a measure of this richness is feasible. These scores could inform the anonymisation process of clinical trials datasets regarding their level of granularity prior to releasing them for secondary research. We propose a strategy for employing these scores in the decision-making process for releasing clinical trials datasets.

查看原文本刊更多论文

在公开的临床试验数据集中评估再识别风险评分：见解和意义。

在科学界共享临床试验匿名数据集的动机正在增加。许多匿名数据集现在可以公开用于二级研究。然而，尚不确定它们是否会对相关参与者构成隐私风险。方法：我们从人类参与者中找到了广泛的公开可获得的、去识别/匿名的随机临床试验数据集样本，并根据其当地程序联系其所有者请求访问。我们对这些数据集中的个人数据进行了分类，包括唯一的直接标识符（如出生日期）和其他个人数据（如性别、年龄和种族）（间接标识符），这些数据本身无法识别个人身份，但在相互结合时可能会识别个人身份。结合间接标识符形成层，并添加更多标识符通过将数据划分为更多较小的层来增加粒度。再识别风险评分方程通过三种方式评估这些地层的隶属度：首先，通过测量高于预定风险阈值水平（Ra）的地层参与者的比例；第二，定位最小地层（Rb）；第三，通过估计数据集中所有地层的平均隶属度（Rc）。风险评分范围从0（最低风险）到1（最高风险）；它们的目的不是重新识别数据集中的个人，而是用于常规收集的健康记录。如果数据集包含直接标识符，则在所有指标中自动得分为1。相反，如果数据集不包含直接标识符或最多包含一个间接标识符，则它在所有指标中自动得分为0。最后，我们探讨了数据集的哪些特征与风险评分相关，并比较了风险评分及其可用性。结果分析了来自14个数据源的70个数据集。31个数据集以最低限度的限制（开放获取）共享，而39个数据集在授予访问权限之前以不同程度的限制共享（控制访问）。数据集平均有四个标识符，平均风险评分从0.47到0.91不等。数据集中最常见的信息组合在一起，可以间接识别参与者的是性别（80%）和年龄（72.9%）。本研究证实，临床试验数据集具有丰富的个人细节，使用再识别风险评分作为这种丰富性的衡量标准是可行的。这些分数可以告知匿名过程的临床试验数据集关于他们的粒度水平之前发布他们的二次研究。我们提出了在发布临床试验数据集的决策过程中使用这些分数的策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical Trials 医学-医学：研究与实验

CiteScore

4.10

自引率

3.70%

发文量

审稿时长

6-12 weeks

期刊介绍： Clinical Trials is dedicated to advancing knowledge on the design and conduct of clinical trials related research methodologies. Covering the design, conduct, analysis, synthesis and evaluation of key methodologies, the journal remains on the cusp of the latest topics, including ethics, regulation and policy impact.