{"title":"Multi-modal transformer for video retrieval using improved sentence embeddings","authors":"Zhi Liu, Fangyuan Zhao, Mengmeng Zhang","doi":"10.1117/12.2643741","DOIUrl":null,"url":null,"abstract":"With the explosive growth of the number of online videos, video retrieval becomes increasingly difficult. Multi-modal visual and language understanding based video-text retrieval is one of the mainstream framework to solve this problem. Among them, MMT (Multi-modal Transformer) is a novel and mainstream model. On the language side, BERT (Bidirectional Encoder Representation for Transformers) is used to encode text, where the pretrained BERT will be fine tuned during training. However, there exists a mismatch in this stage. The pre-training tasks of BERT is based on NSP (Next Sentence Prediction) and MLM(masked language model) which have weak correlation with video retrieval. For text encoder will encode text into semantic embeddings. On the visual side, Transformer is used to aggregate multimodal experts of videos. We find that the output of visual transformer is not fully utilized. In this paper, a sentence- BERT model is introduced to substitute BERT model in MMT to improve sentence embeddings efficiency. In addition, a max-pooling layer is adopted after Transformer to improve the utilization efficiency of the output of the model. Experiment results show that the proposed model outperforms MMT.","PeriodicalId":314555,"journal":{"name":"International Conference on Digital Image Processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Digital Image Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2643741","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
With the explosive growth of the number of online videos, video retrieval becomes increasingly difficult. Multi-modal visual and language understanding based video-text retrieval is one of the mainstream framework to solve this problem. Among them, MMT (Multi-modal Transformer) is a novel and mainstream model. On the language side, BERT (Bidirectional Encoder Representation for Transformers) is used to encode text, where the pretrained BERT will be fine tuned during training. However, there exists a mismatch in this stage. The pre-training tasks of BERT is based on NSP (Next Sentence Prediction) and MLM(masked language model) which have weak correlation with video retrieval. For text encoder will encode text into semantic embeddings. On the visual side, Transformer is used to aggregate multimodal experts of videos. We find that the output of visual transformer is not fully utilized. In this paper, a sentence- BERT model is introduced to substitute BERT model in MMT to improve sentence embeddings efficiency. In addition, a max-pooling layer is adopted after Transformer to improve the utilization efficiency of the output of the model. Experiment results show that the proposed model outperforms MMT.