Synthetic4Health: generating annotated synthetic clinical letters.

IF 3.2 Q1 HEALTH CARE SCIENCES & SERVICES

Frontiers in digital health Pub Date : 2025-05-30 eCollection Date: 2025-01-01 DOI:10.3389/fdgth.2025.1497130

Libo Ren, Samuel Belkadi, Lifeng Han, Warren Del-Pinto, Goran Nenadic

{"title":"Synthetic4Health: generating annotated synthetic clinical letters.","authors":"Libo Ren, Samuel Belkadi, Lifeng Han, Warren Del-Pinto, Goran Nenadic","doi":"10.3389/fdgth.2025.1497130","DOIUrl":null,"url":null,"abstract":"<p><p>Clinical letters contain sensitive information, limiting their use in model training, medical research, and education. This study aims to generate reliable, diverse, and de-identified synthetic clinical letters to support these tasks. We investigated multiple pre-trained language models for text masking and generation, focusing on Bio_ClinicalBERT, and applied different masking strategies. Evaluation included qualitative and quantitative assessments, downstream named entity recognition (NER) tasks, and clinically focused evaluations using BioGPT and GPT-3.5-turbo. The experiments show: (1) encoder-only models perform better than encoder-decoder models; (2) models trained on general corpora perform comparably to clinical-domain models if clinical entities are preserved; (3) preserving clinical entities and document structure aligns with the task objectives; (4) Masking strategies have a noticeable impact on the quality of synthetic clinical letters: masking stopwords has a positive impact, while masking nouns or verbs has a negative effect; (5) The BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references; (6) Contextual information has only a limited effect on the models' understanding, suggesting that synthetic letters can effectively substitute real ones in downstream NER tasks; (7) Although the model occasionally generates hallucinated content, it appears to have little effect on overall clinical performance. Unlike previous research, which primarily focuses on reconstructing original letters by training language models, this paper provides a foundational framework for generating diverse, de-identified clinical letters. It offers a direction for utilizing the model to process real-world clinical letters, thereby helping to expand datasets in the clinical domain. Our codes and trained models are available at https://github.com/HECTA-UoM/Synthetic4Health.</p>","PeriodicalId":73078,"journal":{"name":"Frontiers in digital health","volume":"7 ","pages":"1497130"},"PeriodicalIF":3.2000,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12163008/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdgth.2025.1497130","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Clinical letters contain sensitive information, limiting their use in model training, medical research, and education. This study aims to generate reliable, diverse, and de-identified synthetic clinical letters to support these tasks. We investigated multiple pre-trained language models for text masking and generation, focusing on Bio_ClinicalBERT, and applied different masking strategies. Evaluation included qualitative and quantitative assessments, downstream named entity recognition (NER) tasks, and clinically focused evaluations using BioGPT and GPT-3.5-turbo. The experiments show: (1) encoder-only models perform better than encoder-decoder models; (2) models trained on general corpora perform comparably to clinical-domain models if clinical entities are preserved; (3) preserving clinical entities and document structure aligns with the task objectives; (4) Masking strategies have a noticeable impact on the quality of synthetic clinical letters: masking stopwords has a positive impact, while masking nouns or verbs has a negative effect; (5) The BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references; (6) Contextual information has only a limited effect on the models' understanding, suggesting that synthetic letters can effectively substitute real ones in downstream NER tasks; (7) Although the model occasionally generates hallucinated content, it appears to have little effect on overall clinical performance. Unlike previous research, which primarily focuses on reconstructing original letters by training language models, this paper provides a foundational framework for generating diverse, de-identified clinical letters. It offers a direction for utilizing the model to process real-world clinical letters, thereby helping to expand datasets in the clinical domain. Our codes and trained models are available at https://github.com/HECTA-UoM/Synthetic4Health.

查看原文本刊更多论文

Synthetic4Health：生成注释的合成临床信函。

临床信函包含敏感信息，限制了它们在模型培训、医学研究和教育中的使用。本研究旨在生成可靠的、多样化的、去识别的合成临床信来支持这些任务。我们研究了多种预训练语言模型用于文本掩蔽和生成，重点是Bio_ClinicalBERT，并应用了不同的掩蔽策略。评估包括定性和定量评估，下游命名实体识别（NER）任务，以及使用BioGPT和GPT-3.5-turbo进行临床重点评估。实验表明：(1)纯编码器模型优于编解码器模型；(2)在保留临床实体的情况下，在一般语料库上训练的模型与临床领域模型的表现相当；(3)保持临床实体和文件结构与任务目标一致；(4)掩蔽策略对合成临床字母的质量有显著影响：掩蔽停语对合成临床字母的质量有积极影响，而掩蔽名词或动词对合成临床字母的质量有消极影响；(5)以BERTScore作为主要的定量评价指标，其他指标作为补充参考；(6)上下文信息对模型理解的影响有限，表明合成字母在下游NER任务中可以有效替代真实字母；(7)虽然模型偶尔会产生幻觉内容，但对整体临床表现影响不大。与以往的研究不同，以前的研究主要侧重于通过训练语言模型来重建原始字母，本文提供了一个基本框架来生成多样化的、去识别的临床字母。它为利用该模型处理现实世界的临床信件提供了一个方向，从而有助于扩展临床领域的数据集。我们的代码和训练过的模型可以在https://github.com/HECTA-UoM/Synthetic4Health上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊