求助PDF
{"title":"Evaluation of Vision-Language Models for Detection and Deidentification of Medical Images with Burned-In Protected Health Information.","authors":"Taehee Lee, Hyungjin Kim, Seong Ho Park, Seonhye Chae, Soon Ho Yoon","doi":"10.1148/radiol.243664","DOIUrl":null,"url":null,"abstract":"<p><p>Background Advances in vision-language models (VLMs) may enable detection and deidentification of burned-in protected health information (PHI) on medical images. Purpose To investigate the ability of commercial and open-source VLMs to detect burned-in PHI on medical images, confirm full deidentification, and obscure PHI where present. Materials and Methods In this retrospective study, records of deceased patients aged 18 years or older who died during admission at a tertiary hospital between January and June 2021 were randomly selected. One study per modality was randomly selected. Images were preprocessed to ensure burned-in PHI and test four scenarios of deidentification conditions: all PHI text is visible, PHI text is redacted using asterisks, PHI text is removed, and all text is removed. Real PHI was replaced with fictitious data to protect privacy. Four VLMs (three commercial: ChatGPT-4o [OpenAI], Gemini 1.5 Pro [Google]), and Claude-3 Haiku [Anthropic]; one open-source: Llama 3.2 Vision 11B [Meta]) were tested on three tasks: task 1, overall confirmation of deidentification; task 2, detection and specification of any identifiable PHI items; and task 3, detection and specification of the five preselected PHI items (name, identification number, date of birth, age, and sex). Text was extracted from images using an open-source Tesseract optical character recognition software and input into the VLMs for the same tasks. Additionally, the capability of each VLM to mask detected PHI fields was evaluated. Statistical comparisons were conducted using χ<sup>2</sup>, independent <i>t</i> tests, or generalized estimating equations. Results Data from 100 deceased patients (mean age, 71.1 years ± 10.1 [SD]; 57 male) with 709 imaging studies were randomly included. Among 6696 PHI occurrences, ChatGPT-4o achieved deidentification verification accuracy of 95.0% (<i>n</i> = 6362) for task 1, 61.2% (<i>n</i> = 4098) for task 2, and 96.2% (<i>n</i> = 6441) for task 3, outperforming Gemini 1.5 Pro (68.1%, 55.2%, and 86.3% for tasks 1-3, respectively), Claude-3 Haiku (75.8%, 86.9%, and 79.4% for tasks 1-3, respectively), and Llama 3.2 Vision 11B (51.6%, 66.9%, and 74.3% for tasks 1-3, respectively) (<i>P</i> < .001 for all). Direct image analysis by ChatGPT-4o and Gemini 1.5 Pro was more accurate than the optical character recognition software for PHI detection across all three deidentification verification tasks (<i>P</i> < .001 for all). Among 375 PHI occurrences on 100 images, ChatGPT-4o successfully obscured 81.1% (<i>n</i> = 304) of them. Conclusion ChatGPT-4o demonstrated substantial potential in detecting, verifying, and obscuring burned-in PHI on medical images. © RSNA, 2025 <i>Supplemental material is available for this article.</i> See also the editorial by Pinto dos Santos in this issue.</p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"315 3","pages":"e243664"},"PeriodicalIF":12.1000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.243664","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
引用
批量引用
Abstract
Background Advances in vision-language models (VLMs) may enable detection and deidentification of burned-in protected health information (PHI) on medical images. Purpose To investigate the ability of commercial and open-source VLMs to detect burned-in PHI on medical images, confirm full deidentification, and obscure PHI where present. Materials and Methods In this retrospective study, records of deceased patients aged 18 years or older who died during admission at a tertiary hospital between January and June 2021 were randomly selected. One study per modality was randomly selected. Images were preprocessed to ensure burned-in PHI and test four scenarios of deidentification conditions: all PHI text is visible, PHI text is redacted using asterisks, PHI text is removed, and all text is removed. Real PHI was replaced with fictitious data to protect privacy. Four VLMs (three commercial: ChatGPT-4o [OpenAI], Gemini 1.5 Pro [Google]), and Claude-3 Haiku [Anthropic]; one open-source: Llama 3.2 Vision 11B [Meta]) were tested on three tasks: task 1, overall confirmation of deidentification; task 2, detection and specification of any identifiable PHI items; and task 3, detection and specification of the five preselected PHI items (name, identification number, date of birth, age, and sex). Text was extracted from images using an open-source Tesseract optical character recognition software and input into the VLMs for the same tasks. Additionally, the capability of each VLM to mask detected PHI fields was evaluated. Statistical comparisons were conducted using χ2 , independent t tests, or generalized estimating equations. Results Data from 100 deceased patients (mean age, 71.1 years ± 10.1 [SD]; 57 male) with 709 imaging studies were randomly included. Among 6696 PHI occurrences, ChatGPT-4o achieved deidentification verification accuracy of 95.0% (n = 6362) for task 1, 61.2% (n = 4098) for task 2, and 96.2% (n = 6441) for task 3, outperforming Gemini 1.5 Pro (68.1%, 55.2%, and 86.3% for tasks 1-3, respectively), Claude-3 Haiku (75.8%, 86.9%, and 79.4% for tasks 1-3, respectively), and Llama 3.2 Vision 11B (51.6%, 66.9%, and 74.3% for tasks 1-3, respectively) (P < .001 for all). Direct image analysis by ChatGPT-4o and Gemini 1.5 Pro was more accurate than the optical character recognition software for PHI detection across all three deidentification verification tasks (P < .001 for all). Among 375 PHI occurrences on 100 images, ChatGPT-4o successfully obscured 81.1% (n = 304) of them. Conclusion ChatGPT-4o demonstrated substantial potential in detecting, verifying, and obscuring burned-in PHI on medical images. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Pinto dos Santos in this issue.