基于帧索引视觉转换器的视频摘要

2021 17th International Conference on Machine Vision and Applications (MVA) Pub Date : 2021-07-25 DOI:10.23919/MVA51890.2021.9511350

Tzu-Chun Hsu, Yiping Liao, Chun-Rong Huang

{"title":"基于帧索引视觉转换器的视频摘要","authors":"Tzu-Chun Hsu, Yiping Liao, Chun-Rong Huang","doi":"10.23919/MVA51890.2021.9511350","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a novel frame index vision transformer for video summarization. Given training frames, we linearly project the content of the frames to obtain frame embedding. By incorporating the frame embedding with the index embedding and class embedding, the proposed frame index vision transformer can be efficiently and effectively applied to learn the importance of the input frames. As shown in the experimental results, the proposed method outperforms the state-of-the-art deep learning methods including recurrent neural network (RNN) and convolutional neural network (CNN) based methods in both of the SumMe and TVSum datasets. In addition, our method can achieve real-time computational efficiency during testing.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Video Summarization With Frame Index Vision Transformer\",\"authors\":\"Tzu-Chun Hsu, Yiping Liao, Chun-Rong Huang\",\"doi\":\"10.23919/MVA51890.2021.9511350\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a novel frame index vision transformer for video summarization. Given training frames, we linearly project the content of the frames to obtain frame embedding. By incorporating the frame embedding with the index embedding and class embedding, the proposed frame index vision transformer can be efficiently and effectively applied to learn the importance of the input frames. As shown in the experimental results, the proposed method outperforms the state-of-the-art deep learning methods including recurrent neural network (RNN) and convolutional neural network (CNN) based methods in both of the SumMe and TVSum datasets. In addition, our method can achieve real-time computational efficiency during testing.\",\"PeriodicalId\":312481,\"journal\":{\"name\":\"2021 17th International Conference on Machine Vision and Applications (MVA)\",\"volume\":\"68 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 17th International Conference on Machine Vision and Applications (MVA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/MVA51890.2021.9511350\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 17th International Conference on Machine Vision and Applications (MVA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/MVA51890.2021.9511350","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文提出了一种新的用于视频摘要的帧索引视觉转换器。在给定训练帧的情况下，对训练帧的内容进行线性投影，得到帧嵌入。通过将帧嵌入与索引嵌入和类嵌入相结合，所提出的帧索引视觉转换器能够高效有效地学习输入帧的重要性。实验结果表明，该方法在SumMe和TVSum数据集上都优于最先进的深度学习方法，包括基于循环神经网络(RNN)和卷积神经网络(CNN)的方法。此外，我们的方法可以在测试过程中实现实时计算效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Video Summarization With Frame Index Vision Transformer

In this paper, we propose a novel frame index vision transformer for video summarization. Given training frames, we linearly project the content of the frames to obtain frame embedding. By incorporating the frame embedding with the index embedding and class embedding, the proposed frame index vision transformer can be efficiently and effectively applied to learn the importance of the input frames. As shown in the experimental results, the proposed method outperforms the state-of-the-art deep learning methods including recurrent neural network (RNN) and convolutional neural network (CNN) based methods in both of the SumMe and TVSum datasets. In addition, our method can achieve real-time computational efficiency during testing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 17th International Conference on Machine Vision and Applications (MVA)

自引率

0.00%

发文量