KnowER：高效文本视频检索的知识增强

Intelligent and Converged Networks Pub Date : 2023-06-01 DOI:10.23919/ICN.2023.0009

Hongwei Kou;Yingyun Yang;Yan Hua

{"title":"KnowER：高效文本视频检索的知识增强","authors":"Hongwei Kou;Yingyun Yang;Yan Hua","doi":"10.23919/ICN.2023.0009","DOIUrl":null,"url":null,"abstract":"The widespread adoption of mobile Internet and the Internet of things (IoT) has led to a significant increase in the amount of video data. While video data are increasingly important, language and text remain the primary methods of interaction in everyday communication, text-based cross-modal retrieval has become a crucial demand in many applications. Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training (CLIP) to boost retrieval performance. However, implicit knowledge only records the co-occurrence relationship existing in the data, and it cannot assist the model to understand specific words or scenes. Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph, can play an auxiliary role in understanding the content of different modalities. Therefore, we study the application of external knowledge base in text-video retrieval model for the first time, and propose KnowER, a model based on knowledge enhancement for efficient text-video retrieval. The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets, i.e., MSRVTT, DiDeMo, and MSVD.","PeriodicalId":100681,"journal":{"name":"Intelligent and Converged Networks","volume":"4 2","pages":"93-105"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/9195266/10207889/10208200.pdf","citationCount":"0","resultStr":"{\"title\":\"KnowER: Knowledge enhancement for efficient text-video retrieval\",\"authors\":\"Hongwei Kou;Yingyun Yang;Yan Hua\",\"doi\":\"10.23919/ICN.2023.0009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The widespread adoption of mobile Internet and the Internet of things (IoT) has led to a significant increase in the amount of video data. While video data are increasingly important, language and text remain the primary methods of interaction in everyday communication, text-based cross-modal retrieval has become a crucial demand in many applications. Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training (CLIP) to boost retrieval performance. However, implicit knowledge only records the co-occurrence relationship existing in the data, and it cannot assist the model to understand specific words or scenes. Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph, can play an auxiliary role in understanding the content of different modalities. Therefore, we study the application of external knowledge base in text-video retrieval model for the first time, and propose KnowER, a model based on knowledge enhancement for efficient text-video retrieval. The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets, i.e., MSRVTT, DiDeMo, and MSVD.\",\"PeriodicalId\":100681,\"journal\":{\"name\":\"Intelligent and Converged Networks\",\"volume\":\"4 2\",\"pages\":\"93-105\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/iel7/9195266/10207889/10208200.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligent and Converged Networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10208200/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent and Converged Networks","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10208200/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

移动互联网和物联网(IoT)的广泛采用，导致视频数据量大幅增加。随着视频数据越来越重要，语言和文本仍然是日常交流的主要交互方式，基于文本的跨模态检索已经成为许多应用的关键需求。以往的文本视频检索工作大多利用预训练模型的隐式知识，如对比语言图像预训练(CLIP)来提高检索性能。然而，隐性知识只记录了数据中存在的共现关系，并不能帮助模型理解具体的单词或场景。另一种域外知识是显性知识，它通常以知识图的形式出现，可以在理解不同模态的内容时起到辅助作用。为此，我们首次研究了外部知识库在文本视频检索模型中的应用，提出了基于知识增强的文本视频高效检索模型KnowER。知识增强模型在三个广泛使用的文本视频检索数据集(MSRVTT、DiDeMo和MSVD)上实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

KnowER: Knowledge enhancement for efficient text-video retrieval

The widespread adoption of mobile Internet and the Internet of things (IoT) has led to a significant increase in the amount of video data. While video data are increasingly important, language and text remain the primary methods of interaction in everyday communication, text-based cross-modal retrieval has become a crucial demand in many applications. Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training (CLIP) to boost retrieval performance. However, implicit knowledge only records the co-occurrence relationship existing in the data, and it cannot assist the model to understand specific words or scenes. Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph, can play an auxiliary role in understanding the content of different modalities. Therefore, we study the application of external knowledge base in text-video retrieval model for the first time, and propose KnowER, a model based on knowledge enhancement for efficient text-video retrieval. The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets, i.e., MSRVTT, DiDeMo, and MSVD.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Intelligent and Converged Networks

自引率

0.00%

发文量