Evaluation of Vision-Language Models for Detection and Deidentification of Medical Images with Burned-In Protected Health Information.

IF 12.1 1区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Radiology Pub Date : 2025-06-01 DOI:10.1148/radiol.243664

Taehee Lee, Hyungjin Kim, Seong Ho Park, Seonhye Chae, Soon Ho Yoon

{"title":"Evaluation of Vision-Language Models for Detection and Deidentification of Medical Images with Burned-In Protected Health Information.","authors":"Taehee Lee, Hyungjin Kim, Seong Ho Park, Seonhye Chae, Soon Ho Yoon","doi":"10.1148/radiol.243664","DOIUrl":null,"url":null,"abstract":"Background Advances in vision-language models (VLMs) may enable detection and deidentification of burned-in protected health information (PHI) on medical images. Purpose To investigate the ability of commercial and open-source VLMs to detect burned-in PHI on medical images, confirm full deidentification, and obscure PHI where present. Materials and Methods In this retrospective study, records of deceased patients aged 18 years or older who died during admission at a tertiary hospital between January and June 2021 were randomly selected. One study per modality was randomly selected. Images were preprocessed to ensure burned-in PHI and test four scenarios of deidentification conditions: all PHI text is visible, PHI text is redacted using asterisks, PHI text is removed, and all text is removed. Real PHI was replaced with fictitious data to protect privacy. Four VLMs (three commercial: ChatGPT-4o [OpenAI], Gemini 1.5 Pro [Google]), and Claude-3 Haiku [Anthropic]; one open-source: Llama 3.2 Vision 11B [Meta]) were tested on three tasks: task 1, overall confirmation of deidentification; task 2, detection and specification of any identifiable PHI items; and task 3, detection and specification of the five preselected PHI items (name, identification number, date of birth, age, and sex). Text was extracted from images using an open-source Tesseract optical character recognition software and input into the VLMs for the same tasks. Additionally, the capability of each VLM to mask detected PHI fields was evaluated. Statistical comparisons were conducted using χ2, independent t tests, or generalized estimating equations. Results Data from 100 deceased patients (mean age, 71.1 years ± 10.1 [SD]; 57 male) with 709 imaging studies were randomly included. Among 6696 PHI occurrences, ChatGPT-4o achieved deidentification verification accuracy of 95.0% (n = 6362) for task 1, 61.2% (n = 4098) for task 2, and 96.2% (n = 6441) for task 3, outperforming Gemini 1.5 Pro (68.1%, 55.2%, and 86.3% for tasks 1-3, respectively), Claude-3 Haiku (75.8%, 86.9%, and 79.4% for tasks 1-3, respectively), and Llama 3.2 Vision 11B (51.6%, 66.9%, and 74.3% for tasks 1-3, respectively) (P < .001 for all). Direct image analysis by ChatGPT-4o and Gemini 1.5 Pro was more accurate than the optical character recognition software for PHI detection across all three deidentification verification tasks (P < .001 for all). Among 375 PHI occurrences on 100 images, ChatGPT-4o successfully obscured 81.1% (n = 304) of them. Conclusion ChatGPT-4o demonstrated substantial potential in detecting, verifying, and obscuring burned-in PHI on medical images. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Pinto dos Santos in this issue.","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"315 3","pages":"e243664"},"PeriodicalIF":12.1000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.243664","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Background Advances in vision-language models (VLMs) may enable detection and deidentification of burned-in protected health information (PHI) on medical images. Purpose To investigate the ability of commercial and open-source VLMs to detect burned-in PHI on medical images, confirm full deidentification, and obscure PHI where present. Materials and Methods In this retrospective study, records of deceased patients aged 18 years or older who died during admission at a tertiary hospital between January and June 2021 were randomly selected. One study per modality was randomly selected. Images were preprocessed to ensure burned-in PHI and test four scenarios of deidentification conditions: all PHI text is visible, PHI text is redacted using asterisks, PHI text is removed, and all text is removed. Real PHI was replaced with fictitious data to protect privacy. Four VLMs (three commercial: ChatGPT-4o [OpenAI], Gemini 1.5 Pro [Google]), and Claude-3 Haiku [Anthropic]; one open-source: Llama 3.2 Vision 11B [Meta]) were tested on three tasks: task 1, overall confirmation of deidentification; task 2, detection and specification of any identifiable PHI items; and task 3, detection and specification of the five preselected PHI items (name, identification number, date of birth, age, and sex). Text was extracted from images using an open-source Tesseract optical character recognition software and input into the VLMs for the same tasks. Additionally, the capability of each VLM to mask detected PHI fields was evaluated. Statistical comparisons were conducted using χ², independent t tests, or generalized estimating equations. Results Data from 100 deceased patients (mean age, 71.1 years ± 10.1 [SD]; 57 male) with 709 imaging studies were randomly included. Among 6696 PHI occurrences, ChatGPT-4o achieved deidentification verification accuracy of 95.0% (n = 6362) for task 1, 61.2% (n = 4098) for task 2, and 96.2% (n = 6441) for task 3, outperforming Gemini 1.5 Pro (68.1%, 55.2%, and 86.3% for tasks 1-3, respectively), Claude-3 Haiku (75.8%, 86.9%, and 79.4% for tasks 1-3, respectively), and Llama 3.2 Vision 11B (51.6%, 66.9%, and 74.3% for tasks 1-3, respectively) (P < .001 for all). Direct image analysis by ChatGPT-4o and Gemini 1.5 Pro was more accurate than the optical character recognition software for PHI detection across all three deidentification verification tasks (P < .001 for all). Among 375 PHI occurrences on 100 images, ChatGPT-4o successfully obscured 81.1% (n = 304) of them. Conclusion ChatGPT-4o demonstrated substantial potential in detecting, verifying, and obscuring burned-in PHI on medical images. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Pinto dos Santos in this issue.

查看原文本刊更多论文

基于烧伤保护健康信息的医学图像检测与去识别的视觉语言模型评价。

视觉语言模型（VLMs）的进步可以检测和去识别医学图像上的烧伤保护健康信息（PHI）。目的探讨商用和开源VLMs在医学图像上检测烧伤PHI的能力，确认完全去识别，并在存在PHI的情况下模糊PHI。材料与方法在本回顾性研究中，随机选择2021年1月至6月在某三级医院住院期间死亡的18岁及以上患者的死亡记录。每种模式随机选择一项研究。对图像进行预处理以确保刻录PHI，并测试四种去识别条件：所有PHI文本可见，PHI文本使用星号编辑，PHI文本被删除，所有文本被删除。真实的PHI被虚构的数据所取代，以保护隐私。4台vlm（商用3台：chatgpt - 40 [OpenAI]、Gemini 1.5 Pro [b谷歌]）和Claude-3 Haiku [Anthropic]；一个开源：Llama 3.2 Vision 11B [Meta])在三个任务上进行测试：任务1，整体确认去识别；任务2，检测和规范任何可识别的PHI项目；任务3，对五个预选PHI项目（姓名、身份证号、出生日期、年龄、性别）进行检测和规范。使用开源的Tesseract光学字符识别软件从图像中提取文本，并将其输入VLMs进行相同的任务。此外，还评估了每个VLM屏蔽检测到的PHI场的能力。采用χ2、独立t检验或广义估计方程进行统计学比较。结果100例死亡患者资料(平均年龄71.1岁±10.1岁[SD]；57例男性)，709例影像学研究随机纳入。在6696个PHI出现中，chatgpt - 40对任务1的去识别验证准确率为95.0% (n = 6362)，对任务2的去识别验证准确率为61.2% (n = 4098)，对任务3的去识别验证准确率为96.2% (n = 6441)，优于Gemini 1.5 Pro（任务1-3分别为68.1%、55.2%和86.3%）、claude3 Haiku（任务1-3分别为75.8%、86.9%和79.4%）和Llama 3.2 Vision 11B（任务1-3分别为51.6%、66.9%和74.3%）（P < 0.001）。在所有三个去识别验证任务中，chatgpt - 40和Gemini 1.5 Pro的直接图像分析比光学字符识别软件更准确地进行PHI检测（P < 0.001）。在100张图像的375个PHI出现中，chatgpt - 40成功地掩盖了其中的81.1% （n = 304）。结论chatgpt - 40在检测、验证和模糊医学图像上的烧伤PHI方面具有很大的潜力。©RSNA， 2025本文可获得补充材料。请参阅Pinto dos Santos在本期的社论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Radiology 医学-核医学

CiteScore

35.20

自引率

3.00%

发文量

596

审稿时长

3.6 months

期刊介绍： Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies. Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.