基于多尺度多模态BERT的多媒体广告文本图像检索

Proceedings of the 31st ACM International Conference on Information & Knowledge Management Pub Date : 2022-10-17 DOI:10.1145/3511808.3557653

Tan Yu, Jie Liu, Zhipeng Jin, Yi Yang, Hongliang Fei, Ping Li

{"title":"基于多尺度多模态BERT的多媒体广告文本图像检索","authors":"Tan Yu, Jie Liu, Zhipeng Jin, Yi Yang, Hongliang Fei, Ping Li","doi":"10.1145/3511808.3557653","DOIUrl":null,"url":null,"abstract":"Visual content in multimedia advertising effectively attracts the customer's attention. Search-based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap between texts and images/videos, cross-modal image/video retrieval is a challenging problem. Recently, multi-modal dictionary BERT has bridged the model gap by unifying the images/videos and texts from different modalities through a multi-modal dictionary. In this work, we improve the multi-modal dictionary BERT by developing a multi-scale multi-modal dictionary and propose a Multi-scale Multi-modal Dictionary BERT (M^2D-BERT). The multi-scale dictionary partitions the feature space into different levels and is effective in describing the fine-level relevance and the coarse-level relevance between the text and images. Meanwhile, we constrain that the code-words in dictionaries from different scales to be orthogonal to each other. Thus, it ensures multiple dictionaries are complementary to each other. Moreover, we adopt a two-level residual quantization to enhance the capacity of each multi-modal dictionary. Systematic experiments conducted on large-scale cross-modal retrieval datasets demonstrate the excellent performance of our M2D-BERT.","PeriodicalId":389624,"journal":{"name":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Multi-scale Multi-modal Dictionary BERT For Effective Text-image Retrieval in Multimedia Advertising\",\"authors\":\"Tan Yu, Jie Liu, Zhipeng Jin, Yi Yang, Hongliang Fei, Ping Li\",\"doi\":\"10.1145/3511808.3557653\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual content in multimedia advertising effectively attracts the customer's attention. Search-based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap between texts and images/videos, cross-modal image/video retrieval is a challenging problem. Recently, multi-modal dictionary BERT has bridged the model gap by unifying the images/videos and texts from different modalities through a multi-modal dictionary. In this work, we improve the multi-modal dictionary BERT by developing a multi-scale multi-modal dictionary and propose a Multi-scale Multi-modal Dictionary BERT (M^2D-BERT). The multi-scale dictionary partitions the feature space into different levels and is effective in describing the fine-level relevance and the coarse-level relevance between the text and images. Meanwhile, we constrain that the code-words in dictionaries from different scales to be orthogonal to each other. Thus, it ensures multiple dictionaries are complementary to each other. Moreover, we adopt a two-level residual quantization to enhance the capacity of each multi-modal dictionary. Systematic experiments conducted on large-scale cross-modal retrieval datasets demonstrate the excellent performance of our M2D-BERT.\",\"PeriodicalId\":389624,\"journal\":{\"name\":\"Proceedings of the 31st ACM International Conference on Information & Knowledge Management\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 31st ACM International Conference on Information & Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3511808.3557653\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511808.3557653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

多媒体广告中的视觉内容能有效地吸引顾客的注意力。基于搜索的多媒体广告是一个跨模态检索问题。由于文本和图像/视频之间的模态差异，跨模态图像/视频检索是一个具有挑战性的问题。最近，多模态词典BERT通过一个多模态词典来统一不同模态的图像/视频和文本，从而弥补了模型差距。在这项工作中，我们通过开发一个多尺度多模态字典来改进多模态字典BERT，并提出了一个多尺度多模态字典BERT (M^2D-BERT)。多尺度词典将特征空间划分为不同的层次，能够有效地描述文本与图像之间的细层次相关性和粗层次相关性。同时，我们约束不同尺度字典中的码字彼此正交。因此，它确保多个字典是相互补充的。此外，我们采用两级残差量化来增强每个多模态字典的容量。在大规模跨模态检索数据集上进行的系统实验证明了我们的M2D-BERT的优异性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-scale Multi-modal Dictionary BERT For Effective Text-image Retrieval in Multimedia Advertising

Visual content in multimedia advertising effectively attracts the customer's attention. Search-based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap between texts and images/videos, cross-modal image/video retrieval is a challenging problem. Recently, multi-modal dictionary BERT has bridged the model gap by unifying the images/videos and texts from different modalities through a multi-modal dictionary. In this work, we improve the multi-modal dictionary BERT by developing a multi-scale multi-modal dictionary and propose a Multi-scale Multi-modal Dictionary BERT (M^2D-BERT). The multi-scale dictionary partitions the feature space into different levels and is effective in describing the fine-level relevance and the coarse-level relevance between the text and images. Meanwhile, we constrain that the code-words in dictionaries from different scales to be orthogonal to each other. Thus, it ensures multiple dictionaries are complementary to each other. Moreover, we adopt a two-level residual quantization to enhance the capacity of each multi-modal dictionary. Systematic experiments conducted on large-scale cross-modal retrieval datasets demonstrate the excellent performance of our M2D-BERT.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 31st ACM International Conference on Information & Knowledge Management

自引率

0.00%

发文量