Gender and Ethnicity Bias of Text-to-Image Generative Artificial Intelligence in Medical Imaging, Part 1: Preliminary Evaluation.

IF 1.3 Q4 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Journal of nuclear medicine technology Pub Date : 2024-12-04 DOI:10.2967/jnmt.124.268332

Geoffrey Currie, Johnathan Hewis, Elizabeth Hawk, Eric Rohren

{"title":"Gender and Ethnicity Bias of Text-to-Image Generative Artificial Intelligence in Medical Imaging, Part 1: Preliminary Evaluation.","authors":"Geoffrey Currie, Johnathan Hewis, Elizabeth Hawk, Eric Rohren","doi":"10.2967/jnmt.124.268332","DOIUrl":null,"url":null,"abstract":"Generative artificial intelligence (AI) text-to-image production could reinforce or amplify gender and ethnicity biases. Several text-to-image generative AI tools are used for producing images that represent the medical imaging professions. White male stereotyping and masculine cultures can dissuade women and ethnically divergent people from being drawn into a profession. Methods: In March 2024, DALL-E 3, Firefly 2, Stable Diffusion 2.1, and Midjourney 5.2 were utilized to generate a series of individual and group images of medical imaging professionals: radiologist, nuclear medicine physician, radiographer, and nuclear medicine technologist. Multiple iterations of images were generated using a variety of prompts. Collectively, 184 images were produced for evaluation of 391 characters. All images were independently analyzed by 3 reviewers for apparent gender and skin tone. Results: Collectively (individual and group characters) (n = 391), 60.6% were male and 87.7% were of a light skin tone. DALL-E 3 (65.6%), Midjourney 5.2 (76.7%), and Stable Diffusion 2.1 (56.2%) had a statistically higher representation of men than Firefly 2 (42.9%) (P < 0.0001). With Firefly 2, 70.3% of characters had light skin tones, which was statistically lower (P < 0.0001) than for Stable Diffusion 2.1 (84.8%), Midjourney 5.2 (100%), and DALL-E 3 (94.8%). Overall, image quality metrics were average or better in 87.2% for DALL-E 3 and 86.2% for Midjourney 5.2, whereas 50.9% were inadequate or poor for Firefly 2 and 86.0% for Stable Diffusion 2.1. Conclusion: Generative AI text-to-image generation using DALL-E 3 via GPT-4 has the best overall quality compared with Firefly 2, Midjourney 5.2, and Stable Diffusion 2.1. Nonetheless, DALL-E 3 includes inherent biases associated with gender and ethnicity that demand more critical evaluation.","PeriodicalId":16548,"journal":{"name":"Journal of nuclear medicine technology","volume":" ","pages":"356-359"},"PeriodicalIF":1.3000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of nuclear medicine technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2967/jnmt.124.268332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Generative artificial intelligence (AI) text-to-image production could reinforce or amplify gender and ethnicity biases. Several text-to-image generative AI tools are used for producing images that represent the medical imaging professions. White male stereotyping and masculine cultures can dissuade women and ethnically divergent people from being drawn into a profession. Methods: In March 2024, DALL-E 3, Firefly 2, Stable Diffusion 2.1, and Midjourney 5.2 were utilized to generate a series of individual and group images of medical imaging professionals: radiologist, nuclear medicine physician, radiographer, and nuclear medicine technologist. Multiple iterations of images were generated using a variety of prompts. Collectively, 184 images were produced for evaluation of 391 characters. All images were independently analyzed by 3 reviewers for apparent gender and skin tone. Results: Collectively (individual and group characters) (n = 391), 60.6% were male and 87.7% were of a light skin tone. DALL-E 3 (65.6%), Midjourney 5.2 (76.7%), and Stable Diffusion 2.1 (56.2%) had a statistically higher representation of men than Firefly 2 (42.9%) (P < 0.0001). With Firefly 2, 70.3% of characters had light skin tones, which was statistically lower (P < 0.0001) than for Stable Diffusion 2.1 (84.8%), Midjourney 5.2 (100%), and DALL-E 3 (94.8%). Overall, image quality metrics were average or better in 87.2% for DALL-E 3 and 86.2% for Midjourney 5.2, whereas 50.9% were inadequate or poor for Firefly 2 and 86.0% for Stable Diffusion 2.1. Conclusion: Generative AI text-to-image generation using DALL-E 3 via GPT-4 has the best overall quality compared with Firefly 2, Midjourney 5.2, and Stable Diffusion 2.1. Nonetheless, DALL-E 3 includes inherent biases associated with gender and ethnicity that demand more critical evaluation.

查看原文本刊更多论文

医学影像中文本到图像生成人工智能的性别和种族偏差，第 1 部分：初步评估。

人工智能（AI）文本到图像的生成可能会强化或放大性别和种族偏见。有几种文本到图像的人工智能生成工具被用于制作代表医学影像专业的图像。白人男性的刻板印象和男性化文化可能会阻碍女性和不同种族的人从事某一职业。方法：2024 年 3 月，利用 DALL-E 3、Firefly 2、Stable Diffusion 2.1 和 Midjourney 5.2 生成一系列医学影像专业人员的个人和群体图像：放射科医生、核医学医生、放射技师和核医学技师。使用各种提示多次重复生成图像。总共生成了 184 幅图像，用于评估 391 个字符。所有图像均由 3 位审查员独立分析，以确定明显的性别和肤色。结果总计（单个和群体角色）（n = 391），60.6% 为男性，87.7% 为浅肤色。DALL-E 3》（65.6%）、《Midjourney 5.2》（76.7%）和《Stable Diffusion 2.1》（56.2%）的男性比例高于《Firefly 2》（42.9%）（P < 0.0001）。在《萤火虫 2》中，70.3% 的角色肤色为浅色，在统计学上低于《稳定扩散 2.1》（84.8%）、《Midjourney 5.2》（100%）和《DALL-E 3》（94.8%）（P < 0.0001）。总体而言，87.2%的 DALL-E 3 和 86.2% 的 Midjourney 5.2 图像质量指标达到或优于平均水平，而 50.9% 的 Firefly 2 和 86.0% 的 Stable Diffusion 2.1 图像质量指标不足或较差。结论与 Firefly 2、Midjourney 5.2 和 Stable Diffusion 2.1 相比，通过 GPT-4 使用 DALL-E 3 生成人工智能文本到图像的整体质量最好。不过，DALL-E 3 也存在与性别和种族相关的固有偏差，需要进行更严格的评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of nuclear medicine technology RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING-

CiteScore

1.90

自引率

15.40%

发文量