基于多尺度多模态BERT的多媒体广告文本图像检索

Tan Yu, Jie Liu, Zhipeng Jin, Yi Yang, Hongliang Fei, Ping Li
{"title":"基于多尺度多模态BERT的多媒体广告文本图像检索","authors":"Tan Yu, Jie Liu, Zhipeng Jin, Yi Yang, Hongliang Fei, Ping Li","doi":"10.1145/3511808.3557653","DOIUrl":null,"url":null,"abstract":"Visual content in multimedia advertising effectively attracts the customer's attention. Search-based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap between texts and images/videos, cross-modal image/video retrieval is a challenging problem. Recently, multi-modal dictionary BERT has bridged the model gap by unifying the images/videos and texts from different modalities through a multi-modal dictionary. In this work, we improve the multi-modal dictionary BERT by developing a multi-scale multi-modal dictionary and propose a Multi-scale Multi-modal Dictionary BERT (M^2D-BERT). The multi-scale dictionary partitions the feature space into different levels and is effective in describing the fine-level relevance and the coarse-level relevance between the text and images. Meanwhile, we constrain that the code-words in dictionaries from different scales to be orthogonal to each other. Thus, it ensures multiple dictionaries are complementary to each other. Moreover, we adopt a two-level residual quantization to enhance the capacity of each multi-modal dictionary. Systematic experiments conducted on large-scale cross-modal retrieval datasets demonstrate the excellent performance of our M2D-BERT.","PeriodicalId":389624,"journal":{"name":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Multi-scale Multi-modal Dictionary BERT For Effective Text-image Retrieval in Multimedia Advertising\",\"authors\":\"Tan Yu, Jie Liu, Zhipeng Jin, Yi Yang, Hongliang Fei, Ping Li\",\"doi\":\"10.1145/3511808.3557653\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual content in multimedia advertising effectively attracts the customer's attention. Search-based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap between texts and images/videos, cross-modal image/video retrieval is a challenging problem. Recently, multi-modal dictionary BERT has bridged the model gap by unifying the images/videos and texts from different modalities through a multi-modal dictionary. In this work, we improve the multi-modal dictionary BERT by developing a multi-scale multi-modal dictionary and propose a Multi-scale Multi-modal Dictionary BERT (M^2D-BERT). The multi-scale dictionary partitions the feature space into different levels and is effective in describing the fine-level relevance and the coarse-level relevance between the text and images. Meanwhile, we constrain that the code-words in dictionaries from different scales to be orthogonal to each other. Thus, it ensures multiple dictionaries are complementary to each other. Moreover, we adopt a two-level residual quantization to enhance the capacity of each multi-modal dictionary. Systematic experiments conducted on large-scale cross-modal retrieval datasets demonstrate the excellent performance of our M2D-BERT.\",\"PeriodicalId\":389624,\"journal\":{\"name\":\"Proceedings of the 31st ACM International Conference on Information & Knowledge Management\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 31st ACM International Conference on Information & Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3511808.3557653\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511808.3557653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

多媒体广告中的视觉内容能有效地吸引顾客的注意力。基于搜索的多媒体广告是一个跨模态检索问题。由于文本和图像/视频之间的模态差异,跨模态图像/视频检索是一个具有挑战性的问题。最近,多模态词典BERT通过一个多模态词典来统一不同模态的图像/视频和文本,从而弥补了模型差距。在这项工作中,我们通过开发一个多尺度多模态字典来改进多模态字典BERT,并提出了一个多尺度多模态字典BERT (M^2D-BERT)。多尺度词典将特征空间划分为不同的层次,能够有效地描述文本与图像之间的细层次相关性和粗层次相关性。同时,我们约束不同尺度字典中的码字彼此正交。因此,它确保多个字典是相互补充的。此外,我们采用两级残差量化来增强每个多模态字典的容量。在大规模跨模态检索数据集上进行的系统实验证明了我们的M2D-BERT的优异性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multi-scale Multi-modal Dictionary BERT For Effective Text-image Retrieval in Multimedia Advertising
Visual content in multimedia advertising effectively attracts the customer's attention. Search-based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap between texts and images/videos, cross-modal image/video retrieval is a challenging problem. Recently, multi-modal dictionary BERT has bridged the model gap by unifying the images/videos and texts from different modalities through a multi-modal dictionary. In this work, we improve the multi-modal dictionary BERT by developing a multi-scale multi-modal dictionary and propose a Multi-scale Multi-modal Dictionary BERT (M^2D-BERT). The multi-scale dictionary partitions the feature space into different levels and is effective in describing the fine-level relevance and the coarse-level relevance between the text and images. Meanwhile, we constrain that the code-words in dictionaries from different scales to be orthogonal to each other. Thus, it ensures multiple dictionaries are complementary to each other. Moreover, we adopt a two-level residual quantization to enhance the capacity of each multi-modal dictionary. Systematic experiments conducted on large-scale cross-modal retrieval datasets demonstrate the excellent performance of our M2D-BERT.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信