On Isotropy of Multimodal Embeddings

Inf. Comput. Pub Date : 2023-07-10 DOI:10.3390/info14070392

Kirill Tyshchuk, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev, A. Panchenko

{"title":"On Isotropy of Multimodal Embeddings","authors":"Kirill Tyshchuk, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev, A. Panchenko","doi":"10.3390/info14070392","DOIUrl":null,"url":null,"abstract":"Embeddings, i.e., vector representations of objects, such as texts, images, or graphs, play a key role in deep learning methodologies nowadays. Prior research has shown the importance of analyzing the isotropy of textual embeddings for transformer-based text encoders, such as the BERT model. Anisotropic word embeddings do not use the entire space, instead concentrating on a narrow cone in such a pretrained vector space, negatively affecting the performance of applications, such as textual semantic similarity. Transforming a vector space to optimize isotropy has been shown to be beneficial for improving performance in text processing tasks. This paper is the first comprehensive investigation of the distribution of multimodal embeddings using the example of OpenAI’s CLIP pretrained model. We aimed to deepen the understanding of the embedding space of multimodal embeddings, which has previously been unexplored in this respect, and study the impact on various end tasks. Our initial efforts were focused on measuring the alignment of image and text embedding distributions, with an emphasis on their isotropic properties. In addition, we evaluated several gradient-free approaches to enhance these properties, establishing their efficiency in improving the isotropy/alignment of the embeddings and, in certain cases, the zero-shot classification accuracy. Significantly, our analysis revealed that both CLIP and BERT models yielded embeddings situated within a cone immediately after initialization and preceding training. However, they were mostly isotropic in the local sense. We further extended our investigation to the structure of multilingual CLIP text embeddings, confirming that the observed characteristics were language-independent. By computing the few-shot classification accuracy and point-cloud metrics, we provide evidence of a strong correlation among multilingual embeddings. Embeddings transformation using the methods described in this article makes it easier to visualize embeddings. At the same time, multiple experiments that we conducted showed that, in regard to the transformed embeddings, the downstream tasks performance does not drop substantially (and sometimes is even improved). This means that one could obtain an easily visualizable embedding space, without substantially losing the quality of downstream tasks.","PeriodicalId":13622,"journal":{"name":"Inf. Comput.","volume":"1 1","pages":"392"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Inf. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/info14070392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Embeddings, i.e., vector representations of objects, such as texts, images, or graphs, play a key role in deep learning methodologies nowadays. Prior research has shown the importance of analyzing the isotropy of textual embeddings for transformer-based text encoders, such as the BERT model. Anisotropic word embeddings do not use the entire space, instead concentrating on a narrow cone in such a pretrained vector space, negatively affecting the performance of applications, such as textual semantic similarity. Transforming a vector space to optimize isotropy has been shown to be beneficial for improving performance in text processing tasks. This paper is the first comprehensive investigation of the distribution of multimodal embeddings using the example of OpenAI’s CLIP pretrained model. We aimed to deepen the understanding of the embedding space of multimodal embeddings, which has previously been unexplored in this respect, and study the impact on various end tasks. Our initial efforts were focused on measuring the alignment of image and text embedding distributions, with an emphasis on their isotropic properties. In addition, we evaluated several gradient-free approaches to enhance these properties, establishing their efficiency in improving the isotropy/alignment of the embeddings and, in certain cases, the zero-shot classification accuracy. Significantly, our analysis revealed that both CLIP and BERT models yielded embeddings situated within a cone immediately after initialization and preceding training. However, they were mostly isotropic in the local sense. We further extended our investigation to the structure of multilingual CLIP text embeddings, confirming that the observed characteristics were language-independent. By computing the few-shot classification accuracy and point-cloud metrics, we provide evidence of a strong correlation among multilingual embeddings. Embeddings transformation using the methods described in this article makes it easier to visualize embeddings. At the same time, multiple experiments that we conducted showed that, in regard to the transformed embeddings, the downstream tasks performance does not drop substantially (and sometimes is even improved). This means that one could obtain an easily visualizable embedding space, without substantially losing the quality of downstream tasks.

查看原文本刊更多论文

关于多模态嵌入的各向同性

嵌入，即对象(如文本、图像或图形)的向量表示，在当今的深度学习方法中起着关键作用。先前的研究表明，分析文本嵌入的各向同性对于基于转换的文本编码器(如BERT模型)的重要性。各向异性词嵌入不使用整个空间，而是集中在这样一个预训练的向量空间中的一个窄锥体上，这对应用程序的性能产生了负面影响，例如文本语义相似度。转换矢量空间以优化各向同性已被证明有利于提高文本处理任务的性能。本文以OpenAI的CLIP预训练模型为例，首次全面研究了多模态嵌入的分布。我们的目的是加深对多模态嵌入空间的理解，这是以前在这方面未被探索的，并研究对各种终端任务的影响。我们最初的工作集中在测量图像和文本嵌入分布的对齐，重点是它们的各向同性属性。此外，我们评估了几种无梯度方法来增强这些特性，确定了它们在改善嵌入的各向同性/对齐方面的效率，以及在某些情况下，零射击分类精度。值得注意的是，我们的分析显示CLIP和BERT模型在初始化和训练之前立即产生位于锥体内的嵌入。然而，在当地意义上，它们大多是各向同性的。我们进一步将我们的研究扩展到多语言CLIP文本嵌入的结构，确认观察到的特征是语言无关的。通过计算少射分类精度和点云度量，我们提供了多语言嵌入之间强相关性的证据。使用本文中描述的方法进行嵌入转换可以更容易地可视化嵌入。同时，我们进行的多个实验表明，对于转换后的嵌入，下游任务的性能并没有大幅下降(有时甚至有所提高)。这意味着可以获得一个容易可视化的嵌入空间，而不会实质上损失下游任务的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Inf. Comput.

自引率

0.00%

发文量