UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-07-03 DOI:10.48550/arXiv.2307.00862

Rui Sun, Zhecan Wang, Haoxuan You, N. Codella, Kai-Wei Chang, Shih-Fu Chang

{"title":"UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding","authors":"Rui Sun, Zhecan Wang, Haoxuan You, N. Codella, Kai-Wei Chang, Shih-Fu Chang","doi":"10.48550/arXiv.2307.00862","DOIUrl":null,"url":null,"abstract":"Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Meeting of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2307.00862","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine

查看原文本刊更多论文

UniFine:零镜头视觉语言理解的统一和细粒度方法

视觉语言任务，如VQA、SNLI-VE和VCR是具有挑战性的，因为它们需要模型的推理能力来理解视觉世界和自然语言的语义。用于视觉语言任务的监督方法已经得到了很好的研究。然而，在零射击设置中解决这些任务的探索却很少。由于对比语言-图像预训练(CLIP)在图像-文本匹配上表现出了显著的零射击性能，以往的研究利用其强大的零射击能力，将视觉语言任务转化为图像-文本匹配问题，主要考虑全局级匹配(如整幅图像或句子)。然而，我们发现视觉和文本的细粒度信息，例如句子中的关键词和图像中的对象，可以为语义理解提供相当多的信息。受此启发，我们提出了一个统一的框架，利用细粒度信息进行零镜头视觉语言学习，涵盖VQA、SNLI-VE和VCR等多个任务。我们的实验表明，我们的框架在VQA上优于以前的零射击方法，并在SNLI-VE和VCR上取得了实质性的改进。此外，我们的消融研究证实了我们提出的方法的有效性和普遍性。代码将在https://github.com/ThreeSR/UniFine上提供

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Meeting of the Association for Computational Linguistics

自引率

0.00%

发文量