Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2023-06-01 DOI:10.1109/CVPR52729.2023.01065

Dmytro Kotovenko, Pingchuan Ma, Timo Milbich, B. Ommer

{"title":"Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning","authors":"Dmytro Kotovenko, Pingchuan Ma, Timo Milbich, B. Ommer","doi":"10.1109/CVPR52729.2023.01065","DOIUrl":null,"url":null,"abstract":"Learning compact image embeddings that yield seman-tic similarities between images and that generalize to un-seen test classes, is at the core of deep metric learning (DML). Finding a mapping from a rich, localized image feature map onto a compact embedding vector is challenging: Although similarity emerges between tuples of images, DML approaches marginalize out information in an individ-ual image before considering another image to which simi-larity is to be computed. Instead, we propose during training to condition the em-bedding of an image on the image we want to compare it to. Rather than embedding by a simple pooling as in standard DML, we use cross-attention so that one image can iden-tify relevant features in the other image. Consequently, the attention mechanism establishes a hierarchy of conditional embeddings that gradually incorporates information about the tuple to steer the representation of an individual image. The cross-attention layers bridge the gap between the origi-nal unconditional embedding and the final similarity and al-low backpropagtion to update encodings more directly than through a lossy pooling layer. At test time we use the re-sulting improved unconditional embeddings, thus requiring no additional parameters or computational overhead. Ex-periments on established DML benchmarks show that our cross-attention conditional embedding during training im-proves the underlying standard DML pipeline significantly so that it outperforms the state-of-the-art.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"2003 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52729.2023.01065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Learning compact image embeddings that yield seman-tic similarities between images and that generalize to un-seen test classes, is at the core of deep metric learning (DML). Finding a mapping from a rich, localized image feature map onto a compact embedding vector is challenging: Although similarity emerges between tuples of images, DML approaches marginalize out information in an individ-ual image before considering another image to which simi-larity is to be computed. Instead, we propose during training to condition the em-bedding of an image on the image we want to compare it to. Rather than embedding by a simple pooling as in standard DML, we use cross-attention so that one image can iden-tify relevant features in the other image. Consequently, the attention mechanism establishes a hierarchy of conditional embeddings that gradually incorporates information about the tuple to steer the representation of an individual image. The cross-attention layers bridge the gap between the origi-nal unconditional embedding and the final similarity and al-low backpropagtion to update encodings more directly than through a lossy pooling layer. At test time we use the re-sulting improved unconditional embeddings, thus requiring no additional parameters or computational overhead. Ex-periments on established DML benchmarks show that our cross-attention conditional embedding during training im-proves the underlying standard DML pipeline significantly so that it outperforms the state-of-the-art.

查看原文本刊更多论文

深度度量学习中条件嵌入的交叉图像关注

学习紧凑的图像嵌入，产生图像之间的语义相似性，并推广到未见过的测试类，是深度度量学习(DML)的核心。从丰富的、局部化的图像特征映射到紧凑的嵌入向量上寻找映射是具有挑战性的:尽管图像元组之间存在相似性，但DML方法在考虑要计算相似性的另一个图像之前，会将单个图像中的信息边缘化。相反，我们建议在训练期间将图像的嵌入条件设置为我们想要比较的图像。与标准DML中简单的池化嵌入不同，我们使用交叉注意，这样一个图像可以识别另一个图像中的相关特征。因此，注意机制建立了一个条件嵌入的层次结构，该层次结构逐渐包含有关元组的信息，以引导单个图像的表示。交叉关注层弥补了原始无条件嵌入和最终相似性之间的差距，并且比通过有损池化层更直接地更新编码。在测试时，我们使用得到的改进的无条件嵌入，因此不需要额外的参数或计算开销。在已建立的DML基准测试上的实验表明，我们在训练期间的交叉注意条件嵌入显著地改进了底层标准DML管道，因此它的性能优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量