Proceedings of the 19th International Conference on Content-based Multimedia Indexing最新文献_第4页

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval 为有效的图像-文本匹配和检索提取细粒度对齐分数

Proceedings of the 19th International Conference on Content-based Multimedia Indexing Pub Date : 2022-07-29 DOI: 10.1145/3549555.3549576

Nicola Messina, Matteo Stefanini, M. Cornia, L. Baraldi, F. Falchi, G. Amato, R. Cucchiara

{"title":"ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval","authors":"Nicola Messina, Matteo Stefanini, M. Cornia, L. Baraldi, F. Falchi, G. Amato, R. Cucchiara","doi":"10.1145/3549555.3549576","DOIUrl":"https://doi.org/10.1145/3549555.3549576","url":null,"abstract":"Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space – where an efficient kNN search can be performed – by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN.","PeriodicalId":191591,"journal":{"name":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129028347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Retrieval-Augmented Transformer for Image Captioning 用于图像字幕的检索增强变压器

Proceedings of the 19th International Conference on Content-based Multimedia Indexing Pub Date : 2022-07-26 DOI: 10.1145/3549555.3549585

Sara Sarto, Marcella Cornia, L. Baraldi, R. Cucchiara

引用次数: 23

Deep Features for CBIR with Scarce Data using Hebbian Learning 基于Hebbian学习的稀缺数据CBIR深度特征

Proceedings of the 19th International Conference on Content-based Multimedia Indexing Pub Date : 2022-05-18 DOI: 10.1145/3549555.3549587

Gabriele Lagani, D. Bacciu, C. Gallicchio, F. Falchi, C. Gennaro, G. Amato

{"title":"Deep Features for CBIR with Scarce Data using Hebbian Learning","authors":"Gabriele Lagani, D. Bacciu, C. Gallicchio, F. Falchi, C. Gennaro, G. Amato","doi":"10.1145/3549555.3549587","DOIUrl":"https://doi.org/10.1145/3549555.3549587","url":null,"abstract":"Features extracted from Deep Neural Networks (DNNs) have proven to be very effective in the context of Content Based Image Retrieval (CBIR). Recently, biologically inspired Hebbian learning algorithms have shown promises for DNN training. In this contribution, we study the performance of such algorithms in the development of feature extractors for CBIR tasks. Specifically, we consider a semi-supervised learning strategy in two steps: first, an unsupervised pre-training stage is performed using Hebbian learning on the image dataset; second, the network is fine-tuned using supervised Stochastic Gradient Descent (SGD) training. For the unsupervised pre-training stage, we explore the nonlinear Hebbian Principal Component Analysis (HPCA) learning rule. For the supervised fine-tuning stage, we assume sample efficiency scenarios, in which the amount of labeled samples is just a small fraction of the whole dataset. Our experimental analysis, conducted on the CIFAR10 and CIFAR100 datasets, shows that, when few labeled samples are available, our Hebbian approach provides relevant improvements compared to various alternative methods.","PeriodicalId":191591,"journal":{"name":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114858693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification 一个用于拥挤场景分类的视听数据集和深度学习框架

Proceedings of the 19th International Conference on Content-based Multimedia Indexing Pub Date : 2021-12-16 DOI: 10.1145/3549555.3549568

L. Pham, D. Ngo, Phu X. Nguyen, Hoang Van Truong, Alexander Schindler

引用次数: 5

Proceedings of the 19th International Conference on Content-based Multimedia Indexing 第19届基于内容的多媒体索引国际会议论文集

Proceedings of the 19th International Conference on Content-based Multimedia Indexing Pub Date : 1900-01-01 DOI: 10.1145/3549555

引用次数: 0