Tan Yu, Jie Liu, Zhipeng Jin, Yi Yang, Hongliang Fei, Ping Li
{"title":"Multi-scale Multi-modal Dictionary BERT For Effective Text-image Retrieval in Multimedia Advertising","authors":"Tan Yu, Jie Liu, Zhipeng Jin, Yi Yang, Hongliang Fei, Ping Li","doi":"10.1145/3511808.3557653","DOIUrl":null,"url":null,"abstract":"Visual content in multimedia advertising effectively attracts the customer's attention. Search-based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap between texts and images/videos, cross-modal image/video retrieval is a challenging problem. Recently, multi-modal dictionary BERT has bridged the model gap by unifying the images/videos and texts from different modalities through a multi-modal dictionary. In this work, we improve the multi-modal dictionary BERT by developing a multi-scale multi-modal dictionary and propose a Multi-scale Multi-modal Dictionary BERT (M^2D-BERT). The multi-scale dictionary partitions the feature space into different levels and is effective in describing the fine-level relevance and the coarse-level relevance between the text and images. Meanwhile, we constrain that the code-words in dictionaries from different scales to be orthogonal to each other. Thus, it ensures multiple dictionaries are complementary to each other. Moreover, we adopt a two-level residual quantization to enhance the capacity of each multi-modal dictionary. Systematic experiments conducted on large-scale cross-modal retrieval datasets demonstrate the excellent performance of our M2D-BERT.","PeriodicalId":389624,"journal":{"name":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511808.3557653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Visual content in multimedia advertising effectively attracts the customer's attention. Search-based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap between texts and images/videos, cross-modal image/video retrieval is a challenging problem. Recently, multi-modal dictionary BERT has bridged the model gap by unifying the images/videos and texts from different modalities through a multi-modal dictionary. In this work, we improve the multi-modal dictionary BERT by developing a multi-scale multi-modal dictionary and propose a Multi-scale Multi-modal Dictionary BERT (M^2D-BERT). The multi-scale dictionary partitions the feature space into different levels and is effective in describing the fine-level relevance and the coarse-level relevance between the text and images. Meanwhile, we constrain that the code-words in dictionaries from different scales to be orthogonal to each other. Thus, it ensures multiple dictionaries are complementary to each other. Moreover, we adopt a two-level residual quantization to enhance the capacity of each multi-modal dictionary. Systematic experiments conducted on large-scale cross-modal retrieval datasets demonstrate the excellent performance of our M2D-BERT.