ComAlign: Compositional Alignment in Vision-Language Models

arXiv - CS - Multimedia Pub Date : 2024-09-12 DOI:arxiv-2409.08206

Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah

{"title":"ComAlign: Compositional Alignment in Vision-Language Models","authors":"Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah","doi":"arxiv-2409.08206","DOIUrl":null,"url":null,"abstract":"Vision-language models (VLMs) like CLIP have showcased a remarkable ability\nto extract transferable features for downstream tasks. Nonetheless, the\ntraining process of these models is usually based on a coarse-grained\ncontrastive loss between the global embedding of images and texts which may\nlose the compositional structure of these modalities. Many recent studies have\nshown VLMs lack compositional understandings like attribute binding and\nidentifying object relationships. Although some recent methods have tried to\nachieve finer-level alignments, they either are not based on extracting\nmeaningful components of proper granularity or don't properly utilize the\nmodalities' correspondence (especially in image-text pairs with more\ningredients). Addressing these limitations, we introduce Compositional\nAlignment (ComAlign), a fine-grained approach to discover more exact\ncorrespondence of text and image components using only the weak supervision in\nthe form of image-text pairs. Our methodology emphasizes that the compositional\nstructure (including entities and relations) extracted from the text modality\nmust also be retained in the image modality. To enforce correspondence of\nfine-grained concepts in image and text modalities, we train a lightweight\nnetwork lying on top of existing visual and language encoders using a small\ndataset. The network is trained to align nodes and edges of the structure\nacross the modalities. Experimental results on various VLMs and datasets\ndemonstrate significant improvements in retrieval and compositional benchmarks,\naffirming the effectiveness of our plugin model.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Vision-language models (VLMs) like CLIP have showcased a remarkable ability to extract transferable features for downstream tasks. Nonetheless, the training process of these models is usually based on a coarse-grained contrastive loss between the global embedding of images and texts which may lose the compositional structure of these modalities. Many recent studies have shown VLMs lack compositional understandings like attribute binding and identifying object relationships. Although some recent methods have tried to achieve finer-level alignments, they either are not based on extracting meaningful components of proper granularity or don't properly utilize the modalities' correspondence (especially in image-text pairs with more ingredients). Addressing these limitations, we introduce Compositional Alignment (ComAlign), a fine-grained approach to discover more exact correspondence of text and image components using only the weak supervision in the form of image-text pairs. Our methodology emphasizes that the compositional structure (including entities and relations) extracted from the text modality must also be retained in the image modality. To enforce correspondence of fine-grained concepts in image and text modalities, we train a lightweight network lying on top of existing visual and language encoders using a small dataset. The network is trained to align nodes and edges of the structure across the modalities. Experimental results on various VLMs and datasets demonstrate significant improvements in retrieval and compositional benchmarks, affirming the effectiveness of our plugin model.

查看原文本刊更多论文

ComAlign：视觉语言模型中的构图对齐

像 CLIP 这样的视觉语言模型（VLMs）已经展示了为下游任务提取可转移特征的卓越能力。然而，这些模型的训练过程通常基于图像和文本全局嵌入之间的粗粒度对比损失，这可能会丢失这些模态的组成结构。最近的许多研究表明，VLM 缺乏对组成结构的理解，如属性绑定和识别对象关系。尽管最近的一些方法试图实现更精细的对齐，但它们要么不是基于提取适当粒度的有意义成分，要么没有正确利用模态的对应关系（特别是在成分较多的图像-文本对中）。针对这些局限性，我们引入了合成对齐（ComAlign）技术，这是一种细粒度方法，只使用图像-文本对形式的弱监督来发现文本和图像成分之间更精确的对应关系。我们的方法强调，从文本模态中提取的组成结构（包括实体和关系）也必须保留在图像模态中。为了加强图像和文本模式中细粒度概念的对应性，我们在现有的视觉和语言编码器基础上，使用一个小型数据集训练一个轻量级网络。通过训练，该网络可使跨模态结构的节点和边对齐。在各种 VLM 和数据集上的实验结果表明，我们在检索和合成基准方面取得了显著的改进，从而肯定了我们的插件模型的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量