可控合成临床笔记生成与隐私保证

arXiv - CS - Computation and Language Pub Date : 2024-09-12 DOI:arxiv-2409.07809

Tal BaumelAri, Andre ManoelAri, Daniel JonesAri, Shize SuAri, Huseyin InanAri, AaronAri, Bornstein, Robert Sim

{"title":"可控合成临床笔记生成与隐私保证","authors":"Tal BaumelAri, Andre ManoelAri, Daniel JonesAri, Shize SuAri, Huseyin InanAri, AaronAri, Bornstein, Robert Sim","doi":"arxiv-2409.07809","DOIUrl":null,"url":null,"abstract":"In the field of machine learning, domain-specific annotated data is an\ninvaluable resource for training effective models. However, in the medical\ndomain, this data often includes Personal Health Information (PHI), raising\nsignificant privacy concerns. The stringent regulations surrounding PHI limit\nthe availability and sharing of medical datasets, which poses a substantial\nchallenge for researchers and practitioners aiming to develop advanced machine\nlearning models. In this paper, we introduce a novel method to \"clone\" datasets\ncontaining PHI. Our approach ensures that the cloned datasets retain the\nessential characteristics and utility of the original data without compromising\npatient privacy. By leveraging differential-privacy techniques and a novel\nfine-tuning task, our method produces datasets that are free from identifiable\ninformation while preserving the statistical properties necessary for model\ntraining. We conduct utility testing to evaluate the performance of machine\nlearning models trained on the cloned datasets. The results demonstrate that\nour cloned datasets not only uphold privacy standards but also enhance model\nperformance compared to those trained on traditional anonymized datasets. This\nwork offers a viable solution for the ethical and effective utilization of\nsensitive medical data in machine learning, facilitating progress in medical\nresearch and the development of robust predictive models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Controllable Synthetic Clinical Note Generation with Privacy Guarantees\",\"authors\":\"Tal BaumelAri, Andre ManoelAri, Daniel JonesAri, Shize SuAri, Huseyin InanAri, AaronAri, Bornstein, Robert Sim\",\"doi\":\"arxiv-2409.07809\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the field of machine learning, domain-specific annotated data is an\\ninvaluable resource for training effective models. However, in the medical\\ndomain, this data often includes Personal Health Information (PHI), raising\\nsignificant privacy concerns. The stringent regulations surrounding PHI limit\\nthe availability and sharing of medical datasets, which poses a substantial\\nchallenge for researchers and practitioners aiming to develop advanced machine\\nlearning models. In this paper, we introduce a novel method to \\\"clone\\\" datasets\\ncontaining PHI. Our approach ensures that the cloned datasets retain the\\nessential characteristics and utility of the original data without compromising\\npatient privacy. By leveraging differential-privacy techniques and a novel\\nfine-tuning task, our method produces datasets that are free from identifiable\\ninformation while preserving the statistical properties necessary for model\\ntraining. We conduct utility testing to evaluate the performance of machine\\nlearning models trained on the cloned datasets. The results demonstrate that\\nour cloned datasets not only uphold privacy standards but also enhance model\\nperformance compared to those trained on traditional anonymized datasets. This\\nwork offers a viable solution for the ethical and effective utilization of\\nsensitive medical data in machine learning, facilitating progress in medical\\nresearch and the development of robust predictive models.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07809\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07809","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在机器学习领域，特定领域的注释数据是训练有效模型的宝贵资源。然而，在医疗领域，这些数据通常包括个人健康信息（PHI），从而引发了重大的隐私问题。围绕 PHI 的严格法规限制了医疗数据集的可用性和共享性，这给旨在开发先进机器学习模型的研究人员和从业人员带来了巨大挑战。在本文中，我们介绍了一种 "克隆 "包含 PHI 的数据集的新方法。我们的方法可确保克隆数据集保留原始数据的基本特征和效用，同时不损害患者隐私。通过利用差分隐私技术和新颖的微调任务，我们的方法生成了不含可识别信息的数据集，同时保留了模型训练所需的统计属性。我们进行了实用性测试，以评估在克隆数据集上训练的机器学习模型的性能。结果表明，与在传统匿名数据集上训练的模型相比，我们的克隆数据集不仅维护了隐私标准，还提高了模型性能。这项工作为在机器学习中道德和有效地利用敏感医疗数据提供了可行的解决方案，促进了医学研究的进步和强大预测模型的开发。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Controllable Synthetic Clinical Note Generation with Privacy Guarantees

In the field of machine learning, domain-specific annotated data is an invaluable resource for training effective models. However, in the medical domain, this data often includes Personal Health Information (PHI), raising significant privacy concerns. The stringent regulations surrounding PHI limit the availability and sharing of medical datasets, which poses a substantial challenge for researchers and practitioners aiming to develop advanced machine learning models. In this paper, we introduce a novel method to "clone" datasets containing PHI. Our approach ensures that the cloned datasets retain the essential characteristics and utility of the original data without compromising patient privacy. By leveraging differential-privacy techniques and a novel fine-tuning task, our method produces datasets that are free from identifiable information while preserving the statistical properties necessary for model training. We conduct utility testing to evaluate the performance of machine learning models trained on the cloned datasets. The results demonstrate that our cloned datasets not only uphold privacy standards but also enhance model performance compared to those trained on traditional anonymized datasets. This work offers a viable solution for the ethical and effective utilization of sensitive medical data in machine learning, facilitating progress in medical research and the development of robust predictive models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量