{"title":"基于多重编码的视频文本跨模态检索算法","authors":"Yufan Xu","doi":"10.1117/12.2667669","DOIUrl":null,"url":null,"abstract":"Currently, more and more video data and terminal devices accessing video resources are available to users. Video platforms such as Tiktok and Youtube are gradually rising, and the user scale and video resources are increasing day by day, which brings an urgent practical demand for video-text data cross-modal retrieval. This paper proposes a video-text cross-modal retrieval algorithm based on multiple encoding. By encoding the global features, serial features and local features of video and text, the encoded features are mapped to the common embedding space for training, loss function calculation and optimization. Through experimental verification on MASR-VTT data set and comparison with existing methods, the overall performance R@sum increased by 9.22% and 2.86% respectively, which proved the superiority of this method.","PeriodicalId":128051,"journal":{"name":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Video-text cross-modal retrieval algorithm based on multiple coding\",\"authors\":\"Yufan Xu\",\"doi\":\"10.1117/12.2667669\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Currently, more and more video data and terminal devices accessing video resources are available to users. Video platforms such as Tiktok and Youtube are gradually rising, and the user scale and video resources are increasing day by day, which brings an urgent practical demand for video-text data cross-modal retrieval. This paper proposes a video-text cross-modal retrieval algorithm based on multiple encoding. By encoding the global features, serial features and local features of video and text, the encoded features are mapped to the common embedding space for training, loss function calculation and optimization. Through experimental verification on MASR-VTT data set and comparison with existing methods, the overall performance R@sum increased by 9.22% and 2.86% respectively, which proved the superiority of this method.\",\"PeriodicalId\":128051,\"journal\":{\"name\":\"Third International Seminar on Artificial Intelligence, Networking, and Information Technology\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Third International Seminar on Artificial Intelligence, Networking, and Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2667669\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2667669","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Video-text cross-modal retrieval algorithm based on multiple coding
Currently, more and more video data and terminal devices accessing video resources are available to users. Video platforms such as Tiktok and Youtube are gradually rising, and the user scale and video resources are increasing day by day, which brings an urgent practical demand for video-text data cross-modal retrieval. This paper proposes a video-text cross-modal retrieval algorithm based on multiple encoding. By encoding the global features, serial features and local features of video and text, the encoded features are mapped to the common embedding space for training, loss function calculation and optimization. Through experimental verification on MASR-VTT data set and comparison with existing methods, the overall performance R@sum increased by 9.22% and 2.86% respectively, which proved the superiority of this method.