Relevance-based Margin for Contrastively-trained Video Retrieval Models

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-04-27 DOI:10.1145/3512527.3531395

Alex Falcon, Swathikiran Sudhakaran, G. Serra, Sergio Escalera, O. Lanz

{"title":"Relevance-based Margin for Contrastively-trained Video Retrieval Models","authors":"Alex Falcon, Swathikiran Sudhakaran, G. Serra, Sergio Escalera, O. Lanz","doi":"10.1145/3512527.3531395","DOIUrl":null,"url":null,"abstract":"Video retrieval using natural language queries has attracted increasing interest due to its relevance in real-world applications, from intelligent access in private media galleries to web-scale video search. Learning the cross-similarity of video and text in a joint embedding space is the dominant approach. To do so, a contrastive loss is usually employed because it organizes the embedding space by putting similar items close and dissimilar items far. This framework leads to competitive recall rates, as they solely focus on the rank of the groundtruth items. Yet, assessing the quality of the ranking list is of utmost importance when considering intelligent retrieval systems, since multiple items may share similar semantics, hence a high relevance. Moreover, the aforementioned framework uses a fixed margin to separate similar and dissimilar items, treating all non-groundtruth items as equally irrelevant. In this paper we propose to use a variable margin: we argue that varying the margin used during training based on how much relevant an item is to a given query, i.e. a relevance-based margin, easily improves the quality of the ranking lists measured through nDCG and mAP. We demonstrate the advantages of our technique using different models on EPIC-Kitchens-100 and YouCook2. We show that even if we carefully tuned the fixed margin, our technique (which does not have the margin as a hyper-parameter) would still achieve better performance. Finally, extensive ablation studies and qualitative analysis support the robustness of our approach. Code will be released at \\urlhttps://github.com/aranciokov/RelevanceMargin-ICMR22.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531395","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Video retrieval using natural language queries has attracted increasing interest due to its relevance in real-world applications, from intelligent access in private media galleries to web-scale video search. Learning the cross-similarity of video and text in a joint embedding space is the dominant approach. To do so, a contrastive loss is usually employed because it organizes the embedding space by putting similar items close and dissimilar items far. This framework leads to competitive recall rates, as they solely focus on the rank of the groundtruth items. Yet, assessing the quality of the ranking list is of utmost importance when considering intelligent retrieval systems, since multiple items may share similar semantics, hence a high relevance. Moreover, the aforementioned framework uses a fixed margin to separate similar and dissimilar items, treating all non-groundtruth items as equally irrelevant. In this paper we propose to use a variable margin: we argue that varying the margin used during training based on how much relevant an item is to a given query, i.e. a relevance-based margin, easily improves the quality of the ranking lists measured through nDCG and mAP. We demonstrate the advantages of our technique using different models on EPIC-Kitchens-100 and YouCook2. We show that even if we carefully tuned the fixed margin, our technique (which does not have the margin as a hyper-parameter) would still achieve better performance. Finally, extensive ablation studies and qualitative analysis support the robustness of our approach. Code will be released at \urlhttps://github.com/aranciokov/RelevanceMargin-ICMR22.

查看原文本刊更多论文

基于相关性余量的对比训练视频检索模型

使用自然语言查询的视频检索由于其与现实世界应用的相关性而吸引了越来越多的兴趣，从私人媒体画廊的智能访问到网络规模的视频搜索。在联合嵌入空间中学习视频和文本的交叉相似度是主要的方法。为了做到这一点，通常采用对比损失，因为它通过将相似的项目放近而将不相似的项目放远来组织嵌入空间。这个框架导致了竞争性的召回率，因为他们只关注真实条目的排名。然而，在考虑智能检索系统时，评估排名列表的质量是最重要的，因为多个项目可能具有相似的语义，因此具有高相关性。此外，上述框架使用固定的边距来区分相似和不相似的项目，将所有非事实项视为同等无关。在本文中，我们建议使用可变边距:我们认为在训练过程中根据一个项目与给定查询的相关程度来改变使用的边距，即基于相关性的边距，可以很容易地提高通过nDCG和mAP测量的排名列表的质量。我们在EPIC-Kitchens-100和YouCook2上使用不同的模型来展示我们技术的优势。我们表明，即使我们仔细调整了固定的余量，我们的技术(没有余量作为超参数)仍然可以获得更好的性能。最后，广泛的消融研究和定性分析支持了我们方法的稳健性。代码将在\urlhttps://github.com/aranciokov/RelevanceMargin-ICMR22发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2022 International Conference on Multimedia Retrieval

自引率

0.00%

发文量