Reidentification of Participants in Shared Clinical Data Sets: Experimental Study.

JMIR AI Pub Date : 2024-03-15 DOI:10.2196/52054
Daniela Wiepert, Bradley A Malin, Joseph R Duffy, Rene L Utianski, John L Stricker, David T Jones, Hugo Botha
{"title":"Reidentification of Participants in Shared Clinical Data Sets: Experimental Study.","authors":"Daniela Wiepert, Bradley A Malin, Joseph R Duffy, Rene L Utianski, John L Stricker, David T Jones, Hugo Botha","doi":"10.2196/52054","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large curated data sets are required to leverage speech-based tools in health care. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (ie, voiceprints), sharing recordings raises privacy concerns. This is especially relevant when working with patient data protected under the Health Insurance Portability and Accountability Act.</p><p><strong>Objective: </strong>We aimed to determine the reidentification risk for speech recordings, without reference to demographics or metadata, in clinical data sets considering both the size of the search space (ie, the number of comparisons that must be considered when reidentifying) and the nature of the speech recording (ie, the type of speech task).</p><p><strong>Methods: </strong>Using a state-of-the-art speaker identification model, we modeled an adversarial attack scenario in which an adversary uses a large data set of identified speech (hereafter, the known set) to reidentify as many unknown speakers in a shared data set (hereafter, the unknown set) as possible. We first considered the effect of search space size by attempting reidentification with various sizes of known and unknown sets using VoxCeleb, a data set with recordings of natural, connected speech from >7000 healthy speakers. We then repeated these tests with different types of recordings in each set to examine whether the nature of a speech recording influences reidentification risk. For these tests, we used our clinical data set composed of recordings of elicited speech tasks from 941 speakers.</p><p><strong>Results: </strong>We found that the risk was inversely related to the number of comparisons an adversary must consider (ie, the search space), with a positive linear correlation between the number of false acceptances (FAs) and the number of comparisons (r=0.69; P<.001). The true acceptances (TAs) stayed relatively stable, and the ratio between FAs and TAs rose from 0.02 at 1 × 10<sup>5</sup> comparisons to 1.41 at 6 × 10<sup>6</sup> comparisons, with a near 1:1 ratio at the midpoint of 3 × 10<sup>6</sup> comparisons. In effect, risk was high for a small search space but dropped as the search space grew. We also found that the nature of a speech recording influenced reidentification risk, with nonconnected speech (eg, vowel prolongation: FA/TA=98.5; alternating motion rate: FA/TA=8) being harder to identify than connected speech (eg, sentence repetition: FA/TA=0.54) in cross-task conditions. The inverse was mostly true in within-task conditions, with the FA/TA ratio for vowel prolongation and alternating motion rate dropping to 0.39 and 1.17, respectively.</p><p><strong>Conclusions: </strong>Our findings suggest that speaker identification models can be used to reidentify participants in specific circumstances, but in practice, the reidentification risk appears small. The variation in risk due to search space size and type of speech task provides actionable recommendations to further increase participant privacy and considerations for policy regarding public release of speech recordings.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e52054"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11041495/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/52054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Large curated data sets are required to leverage speech-based tools in health care. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (ie, voiceprints), sharing recordings raises privacy concerns. This is especially relevant when working with patient data protected under the Health Insurance Portability and Accountability Act.

Objective: We aimed to determine the reidentification risk for speech recordings, without reference to demographics or metadata, in clinical data sets considering both the size of the search space (ie, the number of comparisons that must be considered when reidentifying) and the nature of the speech recording (ie, the type of speech task).

Methods: Using a state-of-the-art speaker identification model, we modeled an adversarial attack scenario in which an adversary uses a large data set of identified speech (hereafter, the known set) to reidentify as many unknown speakers in a shared data set (hereafter, the unknown set) as possible. We first considered the effect of search space size by attempting reidentification with various sizes of known and unknown sets using VoxCeleb, a data set with recordings of natural, connected speech from >7000 healthy speakers. We then repeated these tests with different types of recordings in each set to examine whether the nature of a speech recording influences reidentification risk. For these tests, we used our clinical data set composed of recordings of elicited speech tasks from 941 speakers.

Results: We found that the risk was inversely related to the number of comparisons an adversary must consider (ie, the search space), with a positive linear correlation between the number of false acceptances (FAs) and the number of comparisons (r=0.69; P<.001). The true acceptances (TAs) stayed relatively stable, and the ratio between FAs and TAs rose from 0.02 at 1 × 105 comparisons to 1.41 at 6 × 106 comparisons, with a near 1:1 ratio at the midpoint of 3 × 106 comparisons. In effect, risk was high for a small search space but dropped as the search space grew. We also found that the nature of a speech recording influenced reidentification risk, with nonconnected speech (eg, vowel prolongation: FA/TA=98.5; alternating motion rate: FA/TA=8) being harder to identify than connected speech (eg, sentence repetition: FA/TA=0.54) in cross-task conditions. The inverse was mostly true in within-task conditions, with the FA/TA ratio for vowel prolongation and alternating motion rate dropping to 0.39 and 1.17, respectively.

Conclusions: Our findings suggest that speaker identification models can be used to reidentify participants in specific circumstances, but in practice, the reidentification risk appears small. The variation in risk due to search space size and type of speech task provides actionable recommendations to further increase participant privacy and considerations for policy regarding public release of speech recordings.

共享临床数据集中参与者身份的再识别:实验研究
背景:要在医疗保健中利用基于语音的工具,需要大量经过整理的数据集。这些数据集的制作成本很高,因此人们对数据共享越来越感兴趣。由于语音有可能识别说话者(即声纹),共享录音会引发隐私问题。在处理受《健康保险可携性和责任法案》保护的患者数据时,这一点尤为重要:我们旨在确定临床数据集中语音录音的再识别风险,在不参考人口统计学或元数据的情况下,同时考虑搜索空间的大小(即再识别时必须考虑的比较次数)和语音录音的性质(即语音任务的类型):我们使用最先进的扬声器识别模型,模拟了一个对抗性攻击场景,在该场景中,对抗者使用已识别语音的大型数据集(以下简称已知集),尽可能多地重新识别共享数据集(以下简称未知集)中的未知扬声器。我们首先考虑了搜索空间大小的影响,使用 VoxCeleb 尝试使用不同大小的已知集和未知集进行重新识别,VoxCeleb 是一个数据集,包含来自超过 7000 名健康说话者的自然、有关联的语音录音。然后,我们在每个数据集中使用不同类型的录音重复这些测试,以检验语音录音的性质是否会影响再识别风险。在这些测试中,我们使用了临床数据集,该数据集由 941 位发言人的诱导性语音任务录音组成:我们发现,风险与对手必须考虑的比较次数(即搜索空间)成反比,错误接受(FA)次数与比较次数之间呈正线性相关(r=0.69;P5 比较次数为 6 × 106 时为 1.41,3 × 106 比较次数的中点时比率接近 1:1)。实际上,在搜索空间较小的情况下,风险较高,但随着搜索空间的扩大,风险下降。我们还发现,在跨任务条件下,非连接语音(如元音延长:FA/TA=98.5;交替运动速率:FA/TA=8)比连接语音(如句子重复:FA/TA=0.54)更难识别。在任务内条件下,情况则基本相反,元音延长和交替运动速率的 FA/TA 比值分别降至 0.39 和 1.17:我们的研究结果表明,在特定情况下,说话者识别模型可用于重新识别参与者,但在实践中,重新识别的风险似乎很小。搜索空间大小和语音任务类型导致的风险变化为进一步提高参与者隐私提供了可行的建议,也为公开发布语音录音的政策提供了考虑因素。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信