KnowER: Knowledge enhancement for efficient text-video retrieval

Intelligent and Converged Networks Pub Date : 2023-06-01 DOI:10.23919/ICN.2023.0009

Hongwei Kou;Yingyun Yang;Yan Hua

{"title":"KnowER: Knowledge enhancement for efficient text-video retrieval","authors":"Hongwei Kou;Yingyun Yang;Yan Hua","doi":"10.23919/ICN.2023.0009","DOIUrl":null,"url":null,"abstract":"The widespread adoption of mobile Internet and the Internet of things (IoT) has led to a significant increase in the amount of video data. While video data are increasingly important, language and text remain the primary methods of interaction in everyday communication, text-based cross-modal retrieval has become a crucial demand in many applications. Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training (CLIP) to boost retrieval performance. However, implicit knowledge only records the co-occurrence relationship existing in the data, and it cannot assist the model to understand specific words or scenes. Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph, can play an auxiliary role in understanding the content of different modalities. Therefore, we study the application of external knowledge base in text-video retrieval model for the first time, and propose KnowER, a model based on knowledge enhancement for efficient text-video retrieval. The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets, i.e., MSRVTT, DiDeMo, and MSVD.","PeriodicalId":100681,"journal":{"name":"Intelligent and Converged Networks","volume":"4 2","pages":"93-105"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/9195266/10207889/10208200.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent and Converged Networks","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10208200/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The widespread adoption of mobile Internet and the Internet of things (IoT) has led to a significant increase in the amount of video data. While video data are increasingly important, language and text remain the primary methods of interaction in everyday communication, text-based cross-modal retrieval has become a crucial demand in many applications. Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training (CLIP) to boost retrieval performance. However, implicit knowledge only records the co-occurrence relationship existing in the data, and it cannot assist the model to understand specific words or scenes. Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph, can play an auxiliary role in understanding the content of different modalities. Therefore, we study the application of external knowledge base in text-video retrieval model for the first time, and propose KnowER, a model based on knowledge enhancement for efficient text-video retrieval. The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets, i.e., MSRVTT, DiDeMo, and MSVD.

查看原文本刊更多论文

KnowER：高效文本视频检索的知识增强

移动互联网和物联网(IoT)的广泛采用，导致视频数据量大幅增加。随着视频数据越来越重要，语言和文本仍然是日常交流的主要交互方式，基于文本的跨模态检索已经成为许多应用的关键需求。以往的文本视频检索工作大多利用预训练模型的隐式知识，如对比语言图像预训练(CLIP)来提高检索性能。然而，隐性知识只记录了数据中存在的共现关系，并不能帮助模型理解具体的单词或场景。另一种域外知识是显性知识，它通常以知识图的形式出现，可以在理解不同模态的内容时起到辅助作用。为此，我们首次研究了外部知识库在文本视频检索模型中的应用，提出了基于知识增强的文本视频高效检索模型KnowER。知识增强模型在三个广泛使用的文本视频检索数据集(MSRVTT、DiDeMo和MSVD)上实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Intelligent and Converged Networks

自引率

0.00%

发文量