{"title":"Adversarial Attacks to Multi-Modal Models","authors":"Zhihao Dou, Xin Hu, Haibo Yang, Zhuqing Liu, Minghong Fang","doi":"arxiv-2409.06793","DOIUrl":null,"url":null,"abstract":"Multi-modal models have gained significant attention due to their powerful\ncapabilities. These models effectively align embeddings across diverse data\nmodalities, showcasing superior performance in downstream tasks compared to\ntheir unimodal counterparts. Recent study showed that the attacker can\nmanipulate an image or audio file by altering it in such a way that its\nembedding matches that of an attacker-chosen targeted input, thereby deceiving\ndownstream models. However, this method often underperforms due to inherent\ndisparities in data from different modalities. In this paper, we introduce\nCrossFire, an innovative approach to attack multi-modal models. CrossFire\nbegins by transforming the targeted input chosen by the attacker into a format\nthat matches the modality of the original image or audio file. We then\nformulate our attack as an optimization problem, aiming to minimize the angular\ndeviation between the embeddings of the transformed input and the modified\nimage or audio file. Solving this problem determines the perturbations to be\nadded to the original media. Our extensive experiments on six real-world\nbenchmark datasets reveal that CrossFire can significantly manipulate\ndownstream tasks, surpassing existing attacks. Additionally, we evaluate six\ndefensive strategies against CrossFire, finding that current defenses are\ninsufficient to counteract our CrossFire.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06793","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Multi-modal models have gained significant attention due to their powerful
capabilities. These models effectively align embeddings across diverse data
modalities, showcasing superior performance in downstream tasks compared to
their unimodal counterparts. Recent study showed that the attacker can
manipulate an image or audio file by altering it in such a way that its
embedding matches that of an attacker-chosen targeted input, thereby deceiving
downstream models. However, this method often underperforms due to inherent
disparities in data from different modalities. In this paper, we introduce
CrossFire, an innovative approach to attack multi-modal models. CrossFire
begins by transforming the targeted input chosen by the attacker into a format
that matches the modality of the original image or audio file. We then
formulate our attack as an optimization problem, aiming to minimize the angular
deviation between the embeddings of the transformed input and the modified
image or audio file. Solving this problem determines the perturbations to be
added to the original media. Our extensive experiments on six real-world
benchmark datasets reveal that CrossFire can significantly manipulate
downstream tasks, surpassing existing attacks. Additionally, we evaluate six
defensive strategies against CrossFire, finding that current defenses are
insufficient to counteract our CrossFire.