{"title":"基于改进句子嵌入的多模态视频检索变压器","authors":"Zhi Liu, Fangyuan Zhao, Mengmeng Zhang","doi":"10.1117/12.2643741","DOIUrl":null,"url":null,"abstract":"With the explosive growth of the number of online videos, video retrieval becomes increasingly difficult. Multi-modal visual and language understanding based video-text retrieval is one of the mainstream framework to solve this problem. Among them, MMT (Multi-modal Transformer) is a novel and mainstream model. On the language side, BERT (Bidirectional Encoder Representation for Transformers) is used to encode text, where the pretrained BERT will be fine tuned during training. However, there exists a mismatch in this stage. The pre-training tasks of BERT is based on NSP (Next Sentence Prediction) and MLM(masked language model) which have weak correlation with video retrieval. For text encoder will encode text into semantic embeddings. On the visual side, Transformer is used to aggregate multimodal experts of videos. We find that the output of visual transformer is not fully utilized. In this paper, a sentence- BERT model is introduced to substitute BERT model in MMT to improve sentence embeddings efficiency. In addition, a max-pooling layer is adopted after Transformer to improve the utilization efficiency of the output of the model. Experiment results show that the proposed model outperforms MMT.","PeriodicalId":314555,"journal":{"name":"International Conference on Digital Image Processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Multi-modal transformer for video retrieval using improved sentence embeddings\",\"authors\":\"Zhi Liu, Fangyuan Zhao, Mengmeng Zhang\",\"doi\":\"10.1117/12.2643741\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the explosive growth of the number of online videos, video retrieval becomes increasingly difficult. Multi-modal visual and language understanding based video-text retrieval is one of the mainstream framework to solve this problem. Among them, MMT (Multi-modal Transformer) is a novel and mainstream model. On the language side, BERT (Bidirectional Encoder Representation for Transformers) is used to encode text, where the pretrained BERT will be fine tuned during training. However, there exists a mismatch in this stage. The pre-training tasks of BERT is based on NSP (Next Sentence Prediction) and MLM(masked language model) which have weak correlation with video retrieval. For text encoder will encode text into semantic embeddings. On the visual side, Transformer is used to aggregate multimodal experts of videos. We find that the output of visual transformer is not fully utilized. In this paper, a sentence- BERT model is introduced to substitute BERT model in MMT to improve sentence embeddings efficiency. In addition, a max-pooling layer is adopted after Transformer to improve the utilization efficiency of the output of the model. Experiment results show that the proposed model outperforms MMT.\",\"PeriodicalId\":314555,\"journal\":{\"name\":\"International Conference on Digital Image Processing\",\"volume\":\"72 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Digital Image Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2643741\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Digital Image Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2643741","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
随着网络视频数量的爆炸式增长,视频检索变得越来越困难。基于多模态视觉和语言理解的视频文本检索是解决这一问题的主流框架之一。其中,MMT (Multi-modal Transformer)是一种新颖的主流模式。在语言方面,BERT(双向编码器表示)用于对文本进行编码,其中预训练的BERT将在训练期间进行微调。然而,这一阶段存在着不匹配。BERT的预训练任务是基于NSP (Next Sentence Prediction)和MLM(mask language model),这两种方法与视频检索的相关性较弱。对于文本,编码器将文本编码为语义嵌入。在视觉方面,Transformer用于聚合视频的多模态专家。我们发现视觉变压器的输出没有得到充分利用。为了提高句子嵌入的效率,本文引入了句子- BERT模型来代替BERT模型。此外,在Transformer之后加入了max-pooling层,提高了模型输出的利用效率。实验结果表明,该模型优于MMT。
Multi-modal transformer for video retrieval using improved sentence embeddings
With the explosive growth of the number of online videos, video retrieval becomes increasingly difficult. Multi-modal visual and language understanding based video-text retrieval is one of the mainstream framework to solve this problem. Among them, MMT (Multi-modal Transformer) is a novel and mainstream model. On the language side, BERT (Bidirectional Encoder Representation for Transformers) is used to encode text, where the pretrained BERT will be fine tuned during training. However, there exists a mismatch in this stage. The pre-training tasks of BERT is based on NSP (Next Sentence Prediction) and MLM(masked language model) which have weak correlation with video retrieval. For text encoder will encode text into semantic embeddings. On the visual side, Transformer is used to aggregate multimodal experts of videos. We find that the output of visual transformer is not fully utilized. In this paper, a sentence- BERT model is introduced to substitute BERT model in MMT to improve sentence embeddings efficiency. In addition, a max-pooling layer is adopted after Transformer to improve the utilization efficiency of the output of the model. Experiment results show that the proposed model outperforms MMT.