Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe
{"title":"Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI","authors":"Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe","doi":"arxiv-2408.01959","DOIUrl":null,"url":null,"abstract":"Multimodal AI models capable of associating images and text hold promise for\nnumerous domains, ranging from automated image captioning to accessibility\napplications for blind and low-vision users. However, uncertainty about bias\nhas in some cases limited their adoption and availability. In the present work,\nwe study 43 CLIP vision-language models to determine whether they learn\nhuman-like facial impression biases, and we find evidence that such biases are\nreflected across three distinct CLIP model families. We show for the first time\nthat the the degree to which a bias is shared across a society predicts the\ndegree to which it is reflected in a CLIP model. Human-like impressions of\nvisually unobservable attributes, like trustworthiness and sexuality, emerge\nonly in models trained on the largest dataset, indicating that a better fit to\nuncurated cultural data results in the reproduction of increasingly subtle\nsocial biases. Moreover, we use a hierarchical clustering approach to show that\ndataset size predicts the extent to which the underlying structure of facial\nimpression bias resembles that of facial impression bias in humans. Finally, we\nshow that Stable Diffusion models employing CLIP as a text encoder learn facial\nimpression biases, and that these biases intersect with racial biases in Stable\nDiffusion XL-Turbo. While pretrained CLIP models may prove useful for\nscientific studies of bias, they will also require significant dataset curation\nwhen intended for use as general-purpose models in a zero-shot setting.","PeriodicalId":501112,"journal":{"name":"arXiv - CS - Computers and Society","volume":"158 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computers and Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.01959","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal AI models capable of associating images and text hold promise for
numerous domains, ranging from automated image captioning to accessibility
applications for blind and low-vision users. However, uncertainty about bias
has in some cases limited their adoption and availability. In the present work,
we study 43 CLIP vision-language models to determine whether they learn
human-like facial impression biases, and we find evidence that such biases are
reflected across three distinct CLIP model families. We show for the first time
that the the degree to which a bias is shared across a society predicts the
degree to which it is reflected in a CLIP model. Human-like impressions of
visually unobservable attributes, like trustworthiness and sexuality, emerge
only in models trained on the largest dataset, indicating that a better fit to
uncurated cultural data results in the reproduction of increasingly subtle
social biases. Moreover, we use a hierarchical clustering approach to show that
dataset size predicts the extent to which the underlying structure of facial
impression bias resembles that of facial impression bias in humans. Finally, we
show that Stable Diffusion models employing CLIP as a text encoder learn facial
impression biases, and that these biases intersect with racial biases in Stable
Diffusion XL-Turbo. While pretrained CLIP models may prove useful for
scientific studies of bias, they will also require significant dataset curation
when intended for use as general-purpose models in a zero-shot setting.