Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

arXiv - CS - Computers and Society Pub Date : 2024-08-04 DOI:arxiv-2408.01959

Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe

{"title":"Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI","authors":"Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe","doi":"arxiv-2408.01959","DOIUrl":null,"url":null,"abstract":"Multimodal AI models capable of associating images and text hold promise for\nnumerous domains, ranging from automated image captioning to accessibility\napplications for blind and low-vision users. However, uncertainty about bias\nhas in some cases limited their adoption and availability. In the present work,\nwe study 43 CLIP vision-language models to determine whether they learn\nhuman-like facial impression biases, and we find evidence that such biases are\nreflected across three distinct CLIP model families. We show for the first time\nthat the the degree to which a bias is shared across a society predicts the\ndegree to which it is reflected in a CLIP model. Human-like impressions of\nvisually unobservable attributes, like trustworthiness and sexuality, emerge\nonly in models trained on the largest dataset, indicating that a better fit to\nuncurated cultural data results in the reproduction of increasingly subtle\nsocial biases. Moreover, we use a hierarchical clustering approach to show that\ndataset size predicts the extent to which the underlying structure of facial\nimpression bias resembles that of facial impression bias in humans. Finally, we\nshow that Stable Diffusion models employing CLIP as a text encoder learn facial\nimpression biases, and that these biases intersect with racial biases in Stable\nDiffusion XL-Turbo. While pretrained CLIP models may prove useful for\nscientific studies of bias, they will also require significant dataset curation\nwhen intended for use as general-purpose models in a zero-shot setting.","PeriodicalId":501112,"journal":{"name":"arXiv - CS - Computers and Society","volume":"158 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computers and Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.01959","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal AI models capable of associating images and text hold promise for numerous domains, ranging from automated image captioning to accessibility applications for blind and low-vision users. However, uncertainty about bias has in some cases limited their adoption and availability. In the present work, we study 43 CLIP vision-language models to determine whether they learn human-like facial impression biases, and we find evidence that such biases are reflected across three distinct CLIP model families. We show for the first time that the the degree to which a bias is shared across a society predicts the degree to which it is reflected in a CLIP model. Human-like impressions of visually unobservable attributes, like trustworthiness and sexuality, emerge only in models trained on the largest dataset, indicating that a better fit to uncurated cultural data results in the reproduction of increasingly subtle social biases. Moreover, we use a hierarchical clustering approach to show that dataset size predicts the extent to which the underlying structure of facial impression bias resembles that of facial impression bias in humans. Finally, we show that Stable Diffusion models employing CLIP as a text encoder learn facial impression biases, and that these biases intersect with racial biases in Stable Diffusion XL-Turbo. While pretrained CLIP models may prove useful for scientific studies of bias, they will also require significant dataset curation when intended for use as general-purpose models in a zero-shot setting.

查看原文本刊更多论文

数据集规模和社会一致性调节视觉语言人工智能中的面部印象偏差

能够将图像和文本联系起来的多模态人工智能模型有望应用于众多领域，从自动图像字幕到盲人和低视力用户的无障碍应用，不一而足。然而，在某些情况下，偏见的不确定性限制了它们的应用和可用性。在本研究中，我们研究了 43 个 CLIP 视觉语言模型，以确定它们是否学会了类似人类的面部印象偏差。我们首次发现，一个社会中某种偏见的共享程度可以预测它在 CLIP 模型中的反映程度。在最大的数据集上训练的模型中，类似于人类对可视不可观测属性（如可信度和性欲）的印象逐渐显现，这表明更好地拟合文化数据会导致越来越多的次社会偏见的再现。此外，我们使用分层聚类方法表明，数据集的大小可以预测面部印象偏差的基本结构与人类面部印象偏差的相似程度。最后，我们展示了采用 CLIP 作为文本编码器的稳定扩散模型可以学习面部印象偏差，而且这些偏差与稳定扩散 XL-Turbo 中的种族偏差存在交叉。虽然预训练的 CLIP 模型可能会被证明对偏见的科学研究有用，但如果要将其用作零拍摄环境中的通用模型，还需要对数据集进行大量的整理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computers and Society

自引率

0.00%

发文量