为有效的图像-文本匹配和检索提取细粒度对齐分数

Nicola Messina, Matteo Stefanini, M. Cornia, L. Baraldi, F. Falchi, G. Amato, R. Cucchiara
{"title":"为有效的图像-文本匹配和检索提取细粒度对齐分数","authors":"Nicola Messina, Matteo Stefanini, M. Cornia, L. Baraldi, F. Falchi, G. Amato, R. Cucchiara","doi":"10.1145/3549555.3549576","DOIUrl":null,"url":null,"abstract":"Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space – where an efficient kNN search can be performed – by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN.","PeriodicalId":191591,"journal":{"name":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval\",\"authors\":\"Nicola Messina, Matteo Stefanini, M. Cornia, L. Baraldi, F. Falchi, G. Amato, R. Cucchiara\",\"doi\":\"10.1145/3549555.3549576\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space – where an efficient kNN search can be performed – by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN.\",\"PeriodicalId\":191591,\"journal\":{\"name\":\"Proceedings of the 19th International Conference on Content-based Multimedia Indexing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 19th International Conference on Content-based Multimedia Indexing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3549555.3549576\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3549555.3549576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

在涉及视觉和语言联合理解的任务中,图像-文本匹配正日益占据主导地位。在文献中,这个任务经常被用作预训练目标,以构建能够共同处理图像和文本的架构。尽管如此,它有一个直接的下游应用程序:跨模式检索,它包括查找与给定查询文本相关的图像,反之亦然。在跨模态搜索引擎中,解决这个问题至关重要。目前已有许多方法对图像-文本匹配问题提出了有效的解决方案,主要是利用大型视觉语言(VL) Transformer网络。然而,这些模型通常在计算上很昂贵,特别是在推理时。这阻碍了它们在大规模跨模式检索场景中的采用,在这种场景中,结果应该几乎立即提供给用户。在本文中,我们建议通过提出一个对齐和提取网络(ALADIN)来填补有效性和效率之间的差距。ALADIN首先通过在细粒度水平上对齐图像和文本来产生高效分数。然后,它通过提取从细粒度对齐中获得的相关分数来学习共享嵌入空间-可以执行有效的kNN搜索。我们在MS-COCO上获得了显著的结果,表明我们的方法可以与最先进的VL变压器竞争,同时速度快近90倍。复制我们的结果的代码可在https://github.com/mesnico/ALADIN上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space – where an efficient kNN search can be performed – by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信