Proceedings of the 19th International Conference on Content-based Multimedia Indexing最新文献

筛选
英文 中文
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval 为有效的图像-文本匹配和检索提取细粒度对齐分数
Nicola Messina, Matteo Stefanini, M. Cornia, L. Baraldi, F. Falchi, G. Amato, R. Cucchiara
{"title":"ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval","authors":"Nicola Messina, Matteo Stefanini, M. Cornia, L. Baraldi, F. Falchi, G. Amato, R. Cucchiara","doi":"10.1145/3549555.3549576","DOIUrl":"https://doi.org/10.1145/3549555.3549576","url":null,"abstract":"Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space – where an efficient kNN search can be performed – by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN.","PeriodicalId":191591,"journal":{"name":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129028347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Retrieval-Augmented Transformer for Image Captioning 用于图像字幕的检索增强变压器
Sara Sarto, Marcella Cornia, L. Baraldi, R. Cucchiara
{"title":"Retrieval-Augmented Transformer for Image Captioning","authors":"Sara Sarto, Marcella Cornia, L. Baraldi, R. Cucchiara","doi":"10.1145/3549555.3549585","DOIUrl":"https://doi.org/10.1145/3549555.3549585","url":null,"abstract":"Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.","PeriodicalId":191591,"journal":{"name":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115883045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Deep Features for CBIR with Scarce Data using Hebbian Learning 基于Hebbian学习的稀缺数据CBIR深度特征
Gabriele Lagani, D. Bacciu, C. Gallicchio, F. Falchi, C. Gennaro, G. Amato
{"title":"Deep Features for CBIR with Scarce Data using Hebbian Learning","authors":"Gabriele Lagani, D. Bacciu, C. Gallicchio, F. Falchi, C. Gennaro, G. Amato","doi":"10.1145/3549555.3549587","DOIUrl":"https://doi.org/10.1145/3549555.3549587","url":null,"abstract":"Features extracted from Deep Neural Networks (DNNs) have proven to be very effective in the context of Content Based Image Retrieval (CBIR). Recently, biologically inspired Hebbian learning algorithms have shown promises for DNN training. In this contribution, we study the performance of such algorithms in the development of feature extractors for CBIR tasks. Specifically, we consider a semi-supervised learning strategy in two steps: first, an unsupervised pre-training stage is performed using Hebbian learning on the image dataset; second, the network is fine-tuned using supervised Stochastic Gradient Descent (SGD) training. For the unsupervised pre-training stage, we explore the nonlinear Hebbian Principal Component Analysis (HPCA) learning rule. For the supervised fine-tuning stage, we assume sample efficiency scenarios, in which the amount of labeled samples is just a small fraction of the whole dataset. Our experimental analysis, conducted on the CIFAR10 and CIFAR100 datasets, shows that, when few labeled samples are available, our Hebbian approach provides relevant improvements compared to various alternative methods.","PeriodicalId":191591,"journal":{"name":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114858693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification 一个用于拥挤场景分类的视听数据集和深度学习框架
L. Pham, D. Ngo, Phu X. Nguyen, Hoang Van Truong, Alexander Schindler
{"title":"An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification","authors":"L. Pham, D. Ngo, Phu X. Nguyen, Hoang Van Truong, Alexander Schindler","doi":"10.1145/3549555.3549568","DOIUrl":"https://doi.org/10.1145/3549555.3549568","url":null,"abstract":"In this paper, we present the task of audio-visual scene classification (SC) where input videos are classified into one of five real-life crowded scenes: ‘Riot’, ‘Noise-Street’, ‘Firework-Event’, ‘Music-Event’, and ‘Sport-Atmosphere’. To this end, we firstly collect an audio-visual dataset (videos) of these five crowded contexts from Youtube (in-the-wild scenes). Then, a wide range of deep learning classification models are proposed to train either audio or visual input data independently. Finally, results obtained from high-performance models are fused to achieve the best accuracy score. Our experimental results indicate that audio and visual input factors independently contribute to the SC task’s performance. Notably, an ensemble of deep learning models can achieve the best accuracy of 95.7%.","PeriodicalId":191591,"journal":{"name":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127150261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Proceedings of the 19th International Conference on Content-based Multimedia Indexing 第19届基于内容的多媒体索引国际会议论文集
{"title":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","authors":"","doi":"10.1145/3549555","DOIUrl":"https://doi.org/10.1145/3549555","url":null,"abstract":"","PeriodicalId":191591,"journal":{"name":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116822803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信