CoVR-2：用于合成视频检索的自动数据构建。

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-09-20 DOI:10.1109/TPAMI.2024.3463799

Lucas Ventura;Antoine Yang;Cordelia Schmid;Gül Varol

{"title":"CoVR-2：用于合成视频检索的自动数据构建。","authors":"Lucas Ventura;Antoine Yang;Cordelia Schmid;Gül Varol","doi":"10.1109/TPAMI.2024.3463799","DOIUrl":null,"url":null,"abstract":"Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers \n<italic>both\n text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR \n<italic>triplets\n is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption \n<italic>pairs\n, while also expanding the scope of the task to include Composed \n<italic>Video\n Retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11409-11421"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CoVR-2: Automatic Data Construction for Composed Video Retrieval\",\"authors\":\"Lucas Ventura;Antoine Yang;Cordelia Schmid;Gül Varol\",\"doi\":\"10.1109/TPAMI.2024.3463799\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers \\n<italic>both\\n text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR \\n<italic>triplets\\n is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption \\n<italic>pairs\\n, while also expanding the scope of the task to include Composed \\n<italic>Video\\n Retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"46 12\",\"pages\":\"11409-11421\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10685001/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10685001/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

合成图像检索（CoIR）作为一种同时考虑文本和图像查询，在数据库中搜索相关图像的任务，最近越来越受欢迎。大多数 CoIR 方法都需要人工标注数据集，包括图像-文本-图像三元组，其中文本描述了从查询图像到目标图像的修改。然而，人工标注 CoIR 三元组不仅成本高昂，而且不具备可扩展性。在这项工作中，我们提出了一种可扩展的自动数据集创建方法，该方法可根据视频-字幕配对生成三元组，同时还将任务范围扩展到组合视频检索（CoVR）。为此，我们从大型数据库中挖掘具有相似标题的配对视频，并利用大型语言模型生成相应的修改文本。将这一方法应用于广泛的 WebVid2M 数据集，我们自动构建了 WebVid-CoVR 数据集，产生了 160 万个三元组。此外，我们还为 CoVR 引入了一个新的基准，即人工标注的评估集和基准结果。通过使用概念字幕数据集生成 330 万个 CoIR 训练三元组，我们进一步验证了我们的方法同样适用于图像字幕对。我们的模型建立在 BLIP-2 预训练的基础上，使其适用于视频（或图像）检索，并加入了额外的字幕检索损失，以利用三元组之外的额外监督。我们在新的 CoVR 基准上提供了广泛的消减来分析设计选择。我们的实验还证明，在我们的数据集上训练的 CoVR 模型可以有效地转移到 CoIR 上，从而在 CIRR、FashionIQ 和 CIRCO 基准的零镜头设置中提高最先进的性能。我们的代码、数据集和模型可在 https://imagine.enpc.fr/ ventural/covr 上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CoVR-2: Automatic Data Construction for Composed Video Retrieval

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs , while also expanding the scope of the task to include Composed Video Retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量