SHTVS:基于镜头级的视频摘要分层变压器

Proceedings of the 2022 5th International Conference on Image and Graphics Processing Pub Date : 2022-01-07 DOI:10.1145/3512388.3512427

Yubo An, Shenghui Zhao

{"title":"SHTVS:基于镜头级的视频摘要分层变压器","authors":"Yubo An, Shenghui Zhao","doi":"10.1145/3512388.3512427","DOIUrl":null,"url":null,"abstract":"In this paper, a Shot-level based Hierarchical Transformer for Video Summarization (SHTVS) is proposed for supervised video summarization. Different from most existing methods that employ bidirectional long short-term memory or use self-attention to replace certain components while keeping their overall structure in place, our methods show that a pure Transformer with video feature sequences as its input can achieve competitive performance in video summarization. In addition, to make better use of the multi-shot characteristic in a video, each video feature sequence is firstly split into shot-level feature sequences with kernel temporal segmentation, and then fed into shot-level Transformer encoder to learn shot-level representations. Finally, shot-level representations and original video feature sequence are integrated for the frame-level Transformer encoder to predict frame-level importance scores. Extensive experimental results on two benchmark datasets (SumMe and TVSum) prove the effectiveness of our methods.","PeriodicalId":434878,"journal":{"name":"Proceedings of the 2022 5th International Conference on Image and Graphics Processing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SHTVS: Shot-level based Hierarchical Transformer for Video Summarization\",\"authors\":\"Yubo An, Shenghui Zhao\",\"doi\":\"10.1145/3512388.3512427\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, a Shot-level based Hierarchical Transformer for Video Summarization (SHTVS) is proposed for supervised video summarization. Different from most existing methods that employ bidirectional long short-term memory or use self-attention to replace certain components while keeping their overall structure in place, our methods show that a pure Transformer with video feature sequences as its input can achieve competitive performance in video summarization. In addition, to make better use of the multi-shot characteristic in a video, each video feature sequence is firstly split into shot-level feature sequences with kernel temporal segmentation, and then fed into shot-level Transformer encoder to learn shot-level representations. Finally, shot-level representations and original video feature sequence are integrated for the frame-level Transformer encoder to predict frame-level importance scores. Extensive experimental results on two benchmark datasets (SumMe and TVSum) prove the effectiveness of our methods.\",\"PeriodicalId\":434878,\"journal\":{\"name\":\"Proceedings of the 2022 5th International Conference on Image and Graphics Processing\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 5th International Conference on Image and Graphics Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3512388.3512427\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Image and Graphics Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512388.3512427","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一种基于镜头级的分层视频摘要转换器(SHTVS)，用于有监督的视频摘要。与大多数现有方法采用双向长短期记忆或使用自注意来替换某些组件而保持其整体结构的方法不同，我们的方法表明，以视频特征序列作为输入的纯Transformer可以在视频摘要中获得具有竞争力的性能。此外，为了更好地利用视频的多镜头特征，首先将每个视频特征序列通过核时间分割分割成多个镜头级特征序列，然后送入镜头级Transformer编码器学习镜头级表示。最后，将镜头级表示和原始视频特征序列集成到帧级Transformer编码器中，以预测帧级重要性分数。在两个基准数据集(SumMe和TVSum)上的大量实验结果证明了我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SHTVS: Shot-level based Hierarchical Transformer for Video Summarization

In this paper, a Shot-level based Hierarchical Transformer for Video Summarization (SHTVS) is proposed for supervised video summarization. Different from most existing methods that employ bidirectional long short-term memory or use self-attention to replace certain components while keeping their overall structure in place, our methods show that a pure Transformer with video feature sequences as its input can achieve competitive performance in video summarization. In addition, to make better use of the multi-shot characteristic in a video, each video feature sequence is firstly split into shot-level feature sequences with kernel temporal segmentation, and then fed into shot-level Transformer encoder to learn shot-level representations. Finally, shot-level representations and original video feature sequence are integrated for the frame-level Transformer encoder to predict frame-level importance scores. Extensive experimental results on two benchmark datasets (SumMe and TVSum) prove the effectiveness of our methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 5th International Conference on Image and Graphics Processing

自引率

0.00%

发文量