{"title":"A Survey of Multimodal Composite Editing and Retrieval","authors":"Suyan Li, Fuxiang Huang, Lei Zhang","doi":"arxiv-2409.05405","DOIUrl":null,"url":null,"abstract":"In the real world, where information is abundant and diverse across different\nmodalities, understanding and utilizing various data types to improve retrieval\nsystems is a key focus of research. Multimodal composite retrieval integrates\ndiverse modalities such as text, image and audio, etc. to provide more\naccurate, personalized, and contextually relevant results. To facilitate a\ndeeper understanding of this promising direction, this survey explores\nmultimodal composite editing and retrieval in depth, covering image-text\ncomposite editing, image-text composite retrieval, and other multimodal\ncomposite retrieval. In this survey, we systematically organize the application\nscenarios, methods, benchmarks, experiments, and future directions. Multimodal\nlearning is a hot topic in large model era, and have also witnessed some\nsurveys in multimodal learning and vision-language models with transformers\npublished in the PAMI journal. To the best of our knowledge, this survey is the\nfirst comprehensive review of the literature on multimodal composite retrieval,\nwhich is a timely complement of multimodal fusion to existing reviews. To help\nreaders' quickly track this field, we build the project page for this survey,\nwhich can be found at\nhttps://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"62 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05405","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the real world, where information is abundant and diverse across different
modalities, understanding and utilizing various data types to improve retrieval
systems is a key focus of research. Multimodal composite retrieval integrates
diverse modalities such as text, image and audio, etc. to provide more
accurate, personalized, and contextually relevant results. To facilitate a
deeper understanding of this promising direction, this survey explores
multimodal composite editing and retrieval in depth, covering image-text
composite editing, image-text composite retrieval, and other multimodal
composite retrieval. In this survey, we systematically organize the application
scenarios, methods, benchmarks, experiments, and future directions. Multimodal
learning is a hot topic in large model era, and have also witnessed some
surveys in multimodal learning and vision-language models with transformers
published in the PAMI journal. To the best of our knowledge, this survey is the
first comprehensive review of the literature on multimodal composite retrieval,
which is a timely complement of multimodal fusion to existing reviews. To help
readers' quickly track this field, we build the project page for this survey,
which can be found at
https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.